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Abstract. We study here fundamental issues involved in top-fc query evalua- 
tion in probabilistic databases. We consider simple probabilistic databases in 
which probabilities are associated with individual tuples, and general probabilis- 
tic databases in which, additionally, exclusivity relationships between tuples can 
be represented. In contrast to other recent research in this area, we do not limit 
ourselves to injective scoring functions. We formulate three intuitive postulates 
for the semantics of top-A: queries in probabilistic databases, and introduce a 
new semantics, Global-Topfc, that satisfies those postulates to a large degree. We 
also show how to evaluate queries under the Global-Topfc semantics. For sim- 
ple databases we design dynamic-programming based algorithms. For general 
databases we show polynomial-time reductions to the simple cases, and provide 
effective heuristics to speed up the computation in practice. For example, we 
demonstrate that for a fixed k the time complexity of top-fc query evaluation is 
as low as linear, under the assumption that probabilistic databases are simple and 
scoring functions are injective. 

1 Introduction 

The study of incompleteness and uncertainty in databases has long been an interest 
of the database community [2-8]. Recently, this interest has been rekindled by an in- 
creasing demand for managing rich data, often incomplete and uncertain, emerging 
from scientific data management, sensor data management, data cleaning, information 
extraction etc. [9] focuses on query evaluation in traditional probabilistic databases; 
ULDB [10] supports uncertain data and data lineage in Trio [11]; MayBMS [12] uses 
the vertical World-Set representation of uncertain data [13]. The standard semantics 
adopted in most works is the possible worlds semantics [2, 6, 7, 10, 9, 13]. 

On the other hand, since the seminal papers of Fagin [14, 15], the top-fc problem has 
been extensively studied in multimedia databases [16], middleware systems [17], data 
cleaning [18], core technology in relational databases [19, 20] etc. In the top-fc problem, 
each tuple is given a score, and users are interested in k tuples with the highest scores. 

More recently, the top-fc problem has been studied in probabilistic databases [21- 
23]. Those papers, however, are solving two essentially different top-fc problems. Soli- 
man et al. [21, 22] assumes the existence of a scoring function to rank tuples. Probabil- 
ities provide information on how likely tuples will appear in the database. In contrast, 
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in [23], the ranking criterion for top- A; is the probability associated with each query an- 
swer. In many applications, it is necessary to deal with tuple probabilities and scores at 
the same time. Thus, in this paper, we use the model of [21, 22]. Even in this model, dif- 
ferent semantics for top-fc queries are possible, so a part of the challenge is to categorize 
different semantics. 

As a motivating example, let us consider the following graduate admission example. 

Example 1. A graduate admission committee needs to select two winners of a fellow- 
ship. They narrow the candidates down to the following short hst: 



Name 


Overall Score 


Prob. of Coming 


Aidan 


0.65 


0.3 


Bob 


0.55 


0.9 


Chris 


0.45 


0.4 



where the overall score is the normalized score of each candidate based on their quali- 
fications, and the probability of acceptance is derived from historical statistics on can- 
didates with similar quahfications and background. 

The committee want to make offers to the best two candidates who will take the 
offer. This decision problem can be formulated as a top-fc query over the above proba- 
bihstic relation, where k = 2. 

In Example 1, each tuple is associated with an event, which is that the candidate 
will accept the offer. The probability of the event is shown next to each tuple. In this 
example, all the events of tuples are independent, and tuples are therefore said to be 
independent. Such a relation is said to be simple. In contrast. Example 2 illustrates a 
more general case. 

Example 2. In a sensor network deployed in a habitat, each sensor reading comes with 
a confidence value Prob, which is the probabiUty that the reading is valid. The following 
table shows the temperature sensor readings at a given sampling time. These data are 
from two sensors. Sensor 1 and Sensor 2, which correspond to two parts of the relation, 
marked Ci and C2 respectively. Each sensor has only one true reading al a given time, 
therefore tuples from the same part of the relation correspond to exclusive events. 



Temp.°F (Score) 


Prob 


22 


0.6 


10 


0.4 


25 


0.1 


15 


0.6 



Our question is: "What is the temperature of the warmest spot?" 

The question can be formulated as a top-A; query, where k = 1, over a probabilistic 
relation containing the above data. The scoring function is the temperature. However, 
we must take into consideration that the tuples in each part Ci,i= 1, 2, are exclusive. 

Our contributions in this paper are the following: 
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• We formulate three intuitive semantic postulates and use them to analyze and cate- 
gorize different top- semantics in probabilistic databases (Section 3.1); 

• We propose a new semantics for top-fc queries in probabiUstic databases, called 
Global-Topfc, which satisfies the above postulates to a large degree (Section 3.2); 

• We exhibit polynomial algorithms for evaluating top-Zc queries under the Global- 
TopA; semantics in simple probabilistic databases (Section 4.1) and general proba- 
bilistic databases, under injective scoring functions (Section 4.3). 

• We generalize Global-TopA; semantics to general scoring functions, where ties are 
allowed, by introducing the notion of allocation policy. We propose dynamic pro- 
gramming based algorithms for query evaluation under the Equal allocation policy 
(Section 5). 

• We provide theoretical time/space analysis for the algorithms proposed. In some 
cases, we design efficient heuristics to improve the performance of the basic algo- 
rithms (Section 4.2, Section 4.4). Experiments are carried out to demonstrate the 
efficacy of those optimizations (Section 6). 

2 Background 

2.1 Probabilistic Relations 

To simplify the discussion in this paper, we assume that a probabilistic database con- 
tains a single probabilistic relation. We refer to a traditional database relation as a de- 
terministic relation. A deterministic relation i? is a set of tuples. A partition C of i? is a 
collection of non-empty subsets of R such that every tuple belongs to one and only one 
of the subsets. That is, C = {Ci , C2 , . . . , C^} such that Ci U C2 U . . . U = -R and 
Ci (iCj = ^,1 < i ^ j < m. Each subset Ci,i = 1, 2, . . . , m is a part of the partition 
C. A probabilistic relation BP has three components, a support (deterministic) relation 
R, a probabihty function p and a partition C of the support relation R. The probability 
function p maps every tuple in i? to a probability value in (0, 1] . The partition C divides 
R into subsets such that the tuples within each subset are exclusive and therefore their 
probabilities sum up to at most 1. In the graphical presentation of R, we use horizontal 
lines to separate tuples from different parts. 

Definition 1 (Probabilistic Relation). A probabilistic relation Rp is a triplet {R, p, C), 
where R is a support deterministic relation, p is a probability function p : i? (0, 1] 
and C is a partition of R such that MCi e C, X^^g^. p{t) < 1. 

In addition, we make the assumption that tuples from different parts of of C are 
independent, and tuples within the same part are exclusive. Definition 1 is equivalent 
to the model used in Soliman et al. [21, 22] with exclusive tuple generation rules. Re et 
al. [23] proposes a more general model, however only a restricted model with a fixed 
scoring function is used in top-fc query evaluation. 

Example 2 shows an example of a probabilistic relation whose partition has two 
parts. Generally, each part corresponds to a real world entity, in this case, a sensor. 
Since there is only one true state of an entity, tuples from the same part are exclusive. 
Moreover, the probabiUties of all possible states of an entity sum up to at most 1. In 
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Example 2, the sum of the probabihties of tuples from Sensor 1 is 1, while that from 
Sensor 2 is 0.7. This can happen for various reasons. In the above example, we might 
encounter a physical difficulty in collecting the sensor data, and end up with partial 
data. 

Definition 2 (Simple Probabilistic Relation). A probabilistic relation RP = {R, p, C) 
is simple iff the partition C contains only singleton sets. 

The probabilistic relation in Example 1 is simple (individual parts not illustrated). 
Note that in this case, = |C|. 

We adopt the well-known possible worlds semantics for probabiUstic relations [2, 

6,7,10,9,13]. 

Definition 3 (Possible World). Given a probabilistic relation Rp = {R, p, C), a deter- 
ministic relation W is a possible world ofR^ iff 

1. W is a subset of the support relation, i.e., W C R; 

2. For every part Ci in the partition C, at most one tuple from Ci is in W, i.e., yCi G 

c, la nw\< 1; 

3. The probability ofW (definedby Equation (1)) is positive, i.e., Pr{W) > 0. 

Pr{W)=l[p{t) n 

tew dec teCi 

where C = {C^ £ C\W n = 0}. 

Denote by pwdi^R^) the set of all possible worlds of R^. 

2.2 Total order v.s. Weak order 

A binary relation >- is 

- irreflexive: Vx. x )/- x, 

- asymmetric:Va;, y. x y y ^ y ^ x, 

- transitive: Va;, y,z. {x y y Ay y z) =^ x y z, 

- negatively transitive: Vx, y,z.{x)/-y/\y^z)=^x'/z, 

- connected: Va;, y.xyy\/yyx\/x = y. 

A strict partial order is an irreflexive, transitive (and thus asymmetric) binary re- 
lation. A weak order is a negatively transitive strict partial order. A total order is a 
connected strict partial order. 

2.3 Scoring function 

A scoring function over a deterministic relation i? is a function from R to real numbers, 
i.e., s : i? I— > R. The function s induces a preference relation >-s and an indifference 
relation ~s on R. For any two distinct tuples U and tj from R, 
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ti tj iff s{t,) > s{tj); 
ti ~s tj iff s{ti) = s{tj). 

A scoring function over a probabilistic relation RF = {R,p,C) isa scoring function 
s over its support relation R. In general, a scoring function establishes a weak order 
over R, where tuples from R can tie in score. However, when the scoring function s is 
injective, is a total order. In such a case, no two tuples tie in score. 

2,4 Top-k Queries 

Definition 4 (Top-A: Answer Set over a Deterministic Relation). Given a determinis- 
tic relation R, a non-negative integer k and a scoring function s over R, a top-k answer 
set in R under s is a set T of tuples such that 

1. T C R; 

2. If\R\ <k,T = R, otherwise \T\ = k; 

3. yt gTW G R-T.t t' or t t'. 

According to Definition 4, given k and s, there can be more than one top-fc answer 
set in a deterministic relation R. The evaluation of a top-fc query over R returns one of 
them nondeterministically, say S. However, if the scoring function s is injective, S is 
unique, denoted by topk,s{R)- 

3 Semantics of Top-k Queries 

In the following two sections, we restrict our discussion to injective scoring functions. 
We will discuss the generaUzation to general scoring functions in Section 5. 

3.1 Semantic Postulates for Top-fc Answers 

Probability opens the gate for various possible semantics for top-A; queries. As the se- 
mantics of a probabilistic relation involves a set of worlds, it is to be expected that there 
may be more than one top-A; answer set, even under an injective scoring function. The 
answer to a top-A query over a probabilistic relation Rp = {R, p, C) should clearly be a 
set of tuples from its support relation R. We formulate below three desirable postulates, 
which serve as a benchmark to categorize different semantics. 

In the following discussion, denote by Ansk^s {R^) the collection of all top-A; answer 
sets of RP under the function s. 

Postulates 
- Static Postulates 

1. Exact k: When Rp is sufficiently large (|C| > k), the cardinality of every top-fc 
answer set S is exactly fc; 



|C| > fc ^ [yS G AnskAR^)- \s\ = k]. 
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2. Faithfulness: For every top-fc answer set S and any two tuples ti,t2 G R, if 
both the score and the probability of t\ are higher than those of t2 and t2 G 5', 
then ti e 5; 

V5 e Ansfe.sCi?*") Vti, i2 e R. s{ti) > s{t2)Ap{ti) > p{t2)At2 € S ^ h € S. 

- Dynamic Postulate 

U Ansh,s{R^) denotes the union of all top-fc answer sets of R^ = {R,p,C) 
under the function s. For any t G R, 

i is a winner iff f e U Ansk^s (R^) 
i is a loser iff i G i? - U Ansk,s {R^) 

3. Stability: 

• Raising the score/probability of a winner will not turn it into a loser; 

(a) If a scoring function s' is such that s'{t) > s{t) and for every t' £ 
R - {t}, s'{t') = s{t'), then 

t G U Ansk,s{R^) => i G U Ansk,s'{R^). 

(b) If a probabihty function p' is such that p'{t) > p{t) and for every 
i' G i? - {0, y (i') = P{t'), then 

t G U Ansk,s{R^) ^ t G U Ansfe,s((i?P)'), 

where = {R,p',C). 

• Lowering the score/probability of a loser will not turn it into a winner. 

(a) If a scoring function s' is such that s'{t) < s{t) and for every t' G 

R-{t},s'(t') = s(t'), then 

t e R-U AnskAR^) ^ t e R-U Ansk^^'iR^)- 

(b) If a probability function p' is such that p'{t) < p{t) and for every 
t' eR- {i}, p'(i') = p{t'), then 

t G i? - U ^nsfe,s(i?^') ^ f G - U Ansfe,^,((i?^)'), 

where = (R,p',C). 

All of those postulates reflect certain requirements of top- A; answers. 

Exact k expresses user expectations about the size of the result. Typically, a user 
issues a top-fc query in order to restrict the size of the result and get a subset of cardi- 
nality k (cf. Example 1). Therefore, k can be a crucial parameter specified by the user 
that should be complied with. 

Faithfulness reflects the significance of score and probability in a static environ- 
ment. It plays an important role in designing efficient query evalution algorithms. The 
satisfaction of Faithfulness admits a set of pruning techniques based on monotonicity. 

Stability reflects the significance of score and probability in a dynamic environment. 
In a dynamic world, it is common that user might update score/probabihty on-the-fly. 
Stability requires that the consequences of such changes should not be counterintuitive. 
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3.2 Global-Topfe Semantics 

We propose here a new top- A; answer semantics in probabilistic relations, namely Global- 
Topfc, which satisfies the postulates formulated in Section 3.1 to a large degree: 

• Global-Top/c: return k highest-ranked tuples according to their probabihty of being 
in the top-A; answers in possible worlds. 

Considering a probabilistic relation Rp = {R, p, C) under an injective scoring func- 
tion s, any W G p'wd{W) has a unique top-fc answer set topk^s{W). Each tuple from 
the support relation R can be in the top-fc answer set (in the sense of Definition 4) in 
zero, one or more possible worlds of Rp. Therefore, the sum of the probabilities of 
those possible worlds provides a global ranking criterion. 

Definition 5 (Global-Topfc Probability). A.y.yMOTe a prohahilistic relation Rp = {R, p, C), 
a non-negative integer k and an injective scoring function s over Rp. For any tuple t in 
R, the Global-Topk probability oft, denoted by P^l (t), is the sum of the probabilities 
of all possible worlds ofRP whose top-k answer set contains t. 

Wepwd(R") 

*etopfc_.,(W') 

For simplicity, we skip the superscript in P^^ (t), i.e., Pk,s (t), when the context is 
unambiguous. 

Definition 6 (Global-Topfc Answer Set over a Probabilistic Relation). Given a prob- 
abilistic relation Rp — {R,p,C), a non-negative integer k and an injective scoring 
function s over Rp, a Global-Topk answer set in Rp under s is a set T of tuples such 
that 

1. TC R; 

2. If\R\ <k,T = R, otherwise \T\ = fc; 

3. Vt eT,W eR- T, Pk,s{t) > PkAt')- 

Notice the similarity between Definition 6 and Definition 4. In fact, the probabilis- 
tic version only changes the last condition, which restates the preferred relationship 
between two tuples by taking probabihty into account. This semantics preserves the 
nondeterministic nature of Definition 4. For example, if two tuples are of the same 
Global-Topfc probability, and there are fc — 1 tuples with a higher Global-Topfc prob- 
abihty, Definition 6 allows one of the two tuples to be added to the top-fc answer set 
nondeterministically. Example 3 gives an example of the Global-Topfc semantics. 

Example 3. Consider the top-2 query in Example 1. Clearly, the scoring function here 
is the Overall Score function. The following table shows all the possible worlds and 
their probabilities. For each world, the names of the people in the top-2 answer set of 
that world are underhned. 
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Possible World Prob 



Wi 


= 


0.042 




= {Aidan\ 


0.018 




= {Bob} 


0.378 




= {Chris} 


0.028 


Ws 


= {Aidan, Bob} 


0.162 


We 


= } Aidan, Chris} 


0.012 




= {Bob, Chris} 


0.252 




= {Aidan, Bob, Chris} 0.108 



Chris is in the top-2 answer of Wi, VFg, Wj, so the top-2 probability of Chris is 
0.028 + 0.012 + 0.252 = 0.292. Similarly, the top-2 probabiUty of Aidan and Bob 
are 0.9 and 0.3 respectively. 0.9 > 0.3 > 0.292, therefore Global-Topfc will return 
{Aidan, Bob}. 

Note that top- A; answer sets may be of cardinality less than k for some possible 
worlds. We refer to such possible worlds as small worlds. In Example 3, W1...4 are all 
small worlds. 

3.3 Other Semantics 

We present here the most well-established top-fc semantics in the literature before 2008 
(inclusive). 

Soliman et al. [21] proposes two semantics for top-fc queries in probabilistic rela- 
tions. 

• U-Topk: return the most probable top-fc answer set that belongs to possible world(s); 

• U-kRanks: for i = 1, 2, . . . , fc, return the most probable i"' -ranked tuples across all 
possible worlds. 

Hua et al. [24] independently proposes PT-fc, a semantics based on Global-Topfc 
probability as well. PT-fc takes an additional parameter: probability threshold Pr € 
(0,1]. 

• PT-k: return every tuple whose probability of being in the top-fc answers in possible 
worlds is at least pr- 

Example 4. Continuing Example 3, under U-Top/c semantics, the probability of top- 
2 answer set {Bob} is 0.378, and that of {Aidan,Bob} is 0.162 + 0.108 = 0.27. 
Therefore, {Bob} is more probable than {Aidan, Bob} under U-Topfc. In fact, {Bob} 
is the most probable top-2 answer set in this case, and will be returned by U-Topfc. 

Under U-fcRanks semantics, Aidan is in 1"* place in the top-2 answer of W2, W5, 
We, Ws, therefore the probability of Aidan being in 1** place in the top-2 answers in 
possible worlds is 0.018 + 0.162 + 0.012 + 0.108 = 0.3. However, Aidan is not in 
2"'' place in the top-2 answer of any possible world, therefore the probability of Aidan 
being in 2""^ place is 0. In fact, we can construct the following table. 
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Aidan Bob Chris 
Rankl 0.3 0.63 0.028" 
Rank 2 0.27 0.264 

U-ZcRanks selects the tuple with the highest probability at each rank (underlined) 
and takes the union of them. In this example, Bob wins at both Rank 1 and Rank 2. 
Thus, the top-2 answer returned by U-ZcRanks is {Bob}. 

PT-fc returns every tuple with its Global-Top/c probability above the user specified 
threshold pr, therefore the answer depends on p^. Say pr = 0.6, then PT-fc return 
{Aidan}, as it is the only tuple with a Global-Topfc probability at least 0.6. 

The postulates introduced in Section 3.1 lay the ground for analyzing different se- 
mantics. In Table 1, a single "/" (resp. "x") indicates that postulate is (resp. is not) 
satisfied under that semantics. "/ /x" indicates that, the postulate is satisfied by that 
semantics in simple probabilistic relations, but not in the general case. 



Semanlics 


Exact /,: 


Faithfulness 


Stability 


Global-Topfc 


/ 


//X 


/ 


PT-fc 


X 


//X 


/ 


U-Topfc 


X 


//X 


/ 


U-fcRanks 


X 


X 


X 



Table 1. Postulate Satisfaction for Different 
Semantics 



For Exact k, Global-Topfc is the only semantics that satisfies this postulate. Example 
4 illustrates the case where U-Topfc, U-fcRanks and PT-fc violate this postulate. It is not 
satisfied by U-Topfc because a small possible world with a high probability could dom- 
inate other worlds. In this case, the dominating possible world might not have enough 
tuples. It is also violated by U-fcRanks because a single tuple can win at multiple ranks 
in U-fcRanks. In PT-fc, if the threshold parameter Pt is set too high, then less than fc tu- 
ples will be returned (as in Example 4). As Pr decreases, PT-fc return more tuples. In the 
extreme case when p^ approaches 0, any tuple with a positive Global-Topfc probability 
will be returned. 

For Faithfulness, Global-Topfc violates it when exclusion rules lead to a highly re- 
stricted distribution of possible worlds, and are combined with an unfavorable scoring 
function (see Appendix A (5)). PT-fc violates Faithfulness for the same reason {see Ap- 
pendix A (6)). U-Topfc violates Faithfulness since it requires all tuples in a top-fc answer 
set to be compatible. This postulate can be violated when a high-score/probability tuple 
could be dragged down arbitrarily by its compatible tuples which are not very Ukely to 
appear {see Appendix A (7)). U-fcRanks violates both Faithfulness and Stability. Under 
U-fcRanks, instead of a set, a top-fc answer is an ordered vector, where ranks are sig- 
nificant. A change in a tuple's probability/score might have unpredictable consequence 
on ranks, therefore those two postulates are not guaranteed to hold {see Appendix A 
(8)(12)). 

Faithfulness is a postulate which can lead to significant pruning in practice. Even 
though it is not fuUy satisfied by any of the four semantics, some degree of satisfaction 
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can still be beneficial, as it will help us find pruning rules. For example, our optimiza- 
tion in Section 4.2 explores the Faithfulness of Global- Topfc in simple probabilistic 
databases. Another example: one of the pruning techniques in [24] explores the Faith- 
fulness of exclusive tuples in general probabiUstic databases as well. 
See Appendix A for the proofs of the results in Table 1 . 

It worths mentioning here that the intention of Table 1 is to provide a Ust of seman- 
tic postulates, so that users would be able to choose the appropriate postulates for an 
appUcation. For example, in a government contract bidding, only k companies from the 
first round will advance to the second round. The score is inverse to the price offered by 
a company, and the probability is the probability that company will complete the task 
on time. The constraint of k is hard, and thus Exact kisa must for the top-fc semantics 
chosen. In contrast, during college admission, where the score reflects the quaUfication 
of an applicant and the probabiUty is the probability of offer acceptance, while we in- 
tend to have a class of k students, there is usually room for fluctuation. In this case, 
Exact k is not a must. It is the same story with Faithfulness and Stability: Faithfulness is 
required in applications such as auctions, where the score is the value of an item and the 
probability is the availability of the item. In this case, it is a natural to aim at the "best 
deals", i.e., items with high value and high availability. Stability is a common postulate 
required by many dynamic applications. For example, we want to maintain a best k 
seller list, where the score is inverse to the price of an item and the probability is its 
availabihty. It is to be expected that a discounted price and improved availabihty of an 
item should not have an adverse influence on the item's stand on the best k seller listK 

In short, we are not advertising that a specific semantics is superior/inferior to any 
other semantics using Table 1. Rather, with the help of Table 1, users will be able to 
search for the most appropriate semantics based on the right combination of postulates 
for their appUcations. 



4 Query Evaluation under Global-Top/e 
4.1 Simple Probabilistic Relations 

We first consider a simple probabilistic relation E? = {R, p, C) under an injective scor- 
ing function s. 

Proposition 1. Given a simple probabilistic relation RP = {R,p,C) and an injective 
scoring function s over R^, if R = {ti, t2, ■ ■ ■, tn\ and ti t2 ■ ■ ■ tn, the 
following recursion on Global-Topk queries holds: 



fc = 

p{U) l<i<k 
P{U-i) 



q{k,i) = < 

where q{k,i) = Pk,s{ti) andp{ti-i) = 1 - p{ti-i). 



(3) 



(q(k,i — 1)— r- -|- qlk — — l))p(ti) otherwise 

P(U-i) 



' In real life, we sometimes observe cases when stability does not hold: a cheaper Wii console 
with improved availability does not make it more popular than it was. The reason could be 
psychological. 
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Proof. 5ee Appendix B. 

Notice that Equation (3) involves probabilities only, while the scores are used to 
determine the order of computation. 

Example 5. Consider a simple probabilistic relation = {R,p, C), where R = {ti, 
t2,h,t4},p{ti) =Pt,l < i < 4,C = {{tl}, {t2}, {is}, {i4}},andaiiinjective scoring 
function s such that ti ^2 ^3 >~s ti. The following table shows the Global-TopA; 
probabihty of ti, where < fc < 2. 



k 


tl t2 ^3 ti 








1 


Pi PlP2 P1P2P3 PlP2PzPi 


2 


Pi P2 (P2 + PlP2)P3 ((P2 + PlP2)P3 




+PlP2P3)P4 



Row 2 (bold) is each ti's Global-Top2 probabihty. Now, if we are interested in a 
top-2 answer in R^, we only need to pick the two tuples with the highest value in Row 
2. 

Theorem 1 (Correctness of Algorithm 1). Given a simple probabilistic relation R^ = 
{R, p, C), a non-negative integer k and an injective scoring function s, Algorithm 1 
correctly computes a Global-Topk answer set ofR^ under the scoring function s. 

Proof. Algorithm 1 maintains a priority queue to select the k tuples with the highest 
Global-TopA: value. Notice that the nondeterminism is reflected in Line 6 in the algo- 
rithm for maintaining the priority queue in the presence of tying elements. As long as 
Line 2 in Algorithm 1 correctly computes the Global-Topfc probability of each tuple in 
R, Algorithm 1 returns a vahd Global- Topfc answer set. By Proposition 1, Algorithm 2 
correctly computes the Global-Topfc probabihty of tuples in R. 

Algorithm 1 is a one-pass computation over the probabilistic relation, which can 
be easily implemented even if secondary storage is used. The overhead is the initial 
sorting cost (not shown in Algorithm 1), which would be amortized by the workload of 
consecutive top- A; queries. 

Algorithm 2 takes 0{kn) to compute the dynamic programming (DP) table. In ad- 
dition. Algorithm 1 uses a priority queue to maintain the k highest values, which takes 
0{n log k). Altogether, Algorithm 1 takes 0{kn). 

The major space use in Algorithm 1 is the bookkeeping of the DP table in Line 2 
(Algorithm 2). A straightword implementation of Algorithm 1 and Algorithm 2 takes 
0{kn) space. However, notice that in Algorithm 2, the column 5(0 . . .k,i) depends 
on the column q{Q . . . fc, i — 1) only, and for the column q{0 . . .k,i — 1), only the fcth 
value q{k, i — 1) will be used in updating the priority queue in Line 4 of Algorithm 1 
later. Therefore, in practice, we can reduce the space complexity to 0{k) by moving 
the update of the priority queue in Algorithm 1 to Algorithm 2, and using a vector of 
size fc -t- 1 to keep track of the previous column in the DP table. To be more specific, in 
Algorithm 2, each time we finish computing the current column based on the previous 
column in the DP table, we add the fcth value in the current column to the priority 
queue and update the previous column with the current coluimi. For readabihty, we 
present here the original algorithms without this optimization for space. 
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Algorithm 1 (Ind.Topk) Evaluate Global-Topfc Queries in a Simple Probabilistic Re- 
lation under an Injective Scoring Function 
Require: = {R,p,C),k 

Ensure: tuples in R are sorted in the decreasing order based on the scoring function s 
1: Initialize a fixed cardinality (k + 1) priority queue Ans of {t,prob) pairs, which compares 

pairs on prob, i.e., the GIobal-TopA; probability of t; 
2: Calculate Global-TopA; probabilities using Algorithm 2, i.e., 

q{0...k,l... \R\) = Ind_Topk_Sub(i?'', k); 



3: fori = 1 to \R\ do 

4: Add {ti,q{k,i)) lo Ans; 
5: if \Ans\ > k then 

6: remove the pair with the smallest prob value from Ans; 
7: end if 
8: end for 

9: return {ti\{ti,q{k,i)) € Ans}; 



Algorithm 2 (Ind_Topk_Sub) Compute Global-Topfc Probabilities in a Simple Proba- 
bilistic Relation under an Injective Scoring Function 
Require: R" = {R,p,C),k 

Ensure: tuples in R are sorted in the decreasing order based on the scoring function s 

1: g(0, 1) = 0; 
2: for fc' = 1 to fc do 
3: qik',l)=piti); 
4: end for 

5: for i = 2 to \R\ do 
6: for fc' = to fe do 
7: iffc' = Othen 
8: q{k',i)=0; 
9: else 

10: i) = p{U){q{k',i - 1)^^ + q{k' - 1, i - 1)); 

11: end if 

12: end for 

13: end for 

14: return q{Q . . .k,l . . .\R\); 
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4.2 Threshold Algorithm Optimization 

Fagin [15] proposes Threshold Algorithm (TA ) for processing top-fc queries in a middle- 
ware scenario. In a middleware system, an object has m attributes. For each attribute, 
there is a sorted list ranking objects in the decreasing order of its score on that attribute. 
An aggregation function f combines the individual attribute scores Xi, i=l, 2,. . . ,m 
to obtain the overall object score f{xi , 2:2, ... , Xm)- An aggregation function is mono- 
tonic iff f{xi,X2, ■ ■ ■ , Xm) < fix'i, x'2,..., x'm) whenever Xi < x^ for every i. Fagin 
[15] shows that TA is cost-optimal in finding the top-fc objects in such a system. 

Denote T and P for the list of tuples in the decreasing order of score and probability 
respectively. Following the convention in [15], t and p are the last value seen in T and 
P respectively. 

Algorithm 1^^ (TAJnd.Topk) 

( 1 ) Go down T Ust, and fill in entries in the DP table. Specifically, for t = tj, 

compute the entries in the j*^ column up to the fc*'* row. Add tj to the 
top-/c answer set Ans, if any of the following conditions holds: 

(a) Ans has less than k tuples, i.e., \Ans\ < k; 

(b) The Global-Topfc probability of tj, i.e., q{k,j), is greater than 
the lower bound of Ans, i.e., LBAnst where LB Ans = 
mint. e Ans q{k,i). 

In the second case, we also need to drop the tuple with the lowest Global- 
Topfc probability in order to preserve the cardinality of Ans. 

(2) After we have seen at least k tuples in T, we go down P list to find 
the first p whose tuple t has not been seen. Let p = p, and we can use 
p to estimate the threshold, i.e., upper bound (UP) of the Global-Topfc 
probability of any unseen tuple. Assume t = ti, 

UP={q{k,i)^^+q{k-l,i))p. 

(3) JfUP> LB Ans, Ans might be updated in the future, so go back to (1). 
Otherwise, we can safely stop and report Ans. 



Theorem 2 (Correctness of Algorithm 1^"*). Given a simple probabilistic relation 

RF = {R,p, C), a non-negative integer k and an injective scoring function s over BF, 
the above TA-based algorithm correctly finds a Global-Topk answer set 

Proof. See Appendix B. 

The optimization above aims at an early stop. Bruno et al. [25] carries out an exten- 
sive experimental study on the effectiveness of applying TA in RDMBS. They consider 
various aspects of query processing. One of their conclusions is that if at least one of the 
indices available for the attributes^ is a covering index, that is, it is defined over all other 
attributes and we can get the values of all other attributes directly without performing 
a primary index lookup, then the improvement by TA can be up to two orders of mag- 
nitude. The cost of building a useful set of indices once would be amortized by a large 

^ Probability is typically supported as a special attribute in DBMS. 



14 



number of top-fc queries that subsequently benefit form such indices. Even in the lack 
of covering indices, if the data is highly correlated, in our case, that means high-score 
tuples having high probabihties, TA would still be effective. 

TA is guaranteed to work as long as the aggregation function is monotonic. For a 
simple probabilistic relation, if we regard icore and probability as two special attributes, 
Global-Top/c probability Pk,s is an aggregation function of score and probability. The 
Faithfulness postulate in Section 3.1 implies the monotonicity of Global-Topfc proba- 
biUty in simple probabiUstic relations. Consequently, assuming that we have an index 
on probabiUty as well, we can guide the dynamic programming (DP) in Algorithm 2 by 
TA. Now, instead of computing all kn entries for DP, where n = the algorithm can 
be stopped as early as possible. A subtlety is that Global-Topfc probabihty Pk^s is only 
well-defined for t £ R, unhke in [15], where an aggregation function is well-defined 
over the domain of all possible attribute values. Therefore, compared to the original TA, 
we need to achieve the same behavior without referring to virtual tuples which are not 
ini?. 

U-Topfc satisfies Faithfulness in simple probabilistic relations. An adaptation of the 
TA algorithm in this case is available in [22]. TA is not applicable to U-ZcRanks. Even 
though we can define an aggregation function per rank, rank = 1, 2, . . . , fc, for tuples 
under U-fcRanks, the violation of Faithfulness in Table 1 suggests a violation of mono- 
tonicity of those k aggregation functions. PT-A: computes Global-Top/c probabihties as 
well, and is therefore a natural candidate for TA in simple probabilistic relations. 

4.3 Arbitrary Probabilistic Relations 

Induced Event Relation In the general case of probabilistic relations (Definition 1), 
each part of the partition C can contain more than one tuple. The crucial independence 
assumption in Algorithm 1 no longer holds. However, even though tuples from one part 
of the partition C are not independent, tuples from different parts are. In the following 
definition, we assume an identifier function id. For any tuple t, id{t) identifies the part 
where t belongs. 

Definition 7 (Induced Event Relation). Given a probabilistic relation BP = {R, p, C), 

an injective scoring function s over Rp and a tuple t € CicK^t) S C, the event relation 
induced by t, denoted by = {E,p^ ,C^), is a probabilistic relation whose support 
relation E has only one attribute, Event. The relation E and the probability function 
are defined by the following two generation rules: 

- Rulel: e Eandp^itet) = p{t); 

- Rule 2: VC^ e C A Q ^ Cid(t). 

{3t' GCiAt' t) {tec^ G E) andp'^it,^^ ) = ^ p{t'). 

No other tuples belong to E. The partition is defined as the collection of singleton 
subsets ofE. 



15 



Except for one special tuple generated by Rule 1, each tuple in the induced event 
relation (generated by Rule 2) represents an event td associated with a part Cj G C. 
Given the tuple t, the event Cd is defined as "there is a tuple from the part Ci with a 
score higher than that of t". The probabiUty of this event, denoted by p{tec. is the 
probability that ec^ occurs. 

The role of the special tuple t^^ and its probability p{t) will become clear in Propo- 
sition 3. Let us first look at an example of an induced event relation. 

Example 6. Given RP as in Example 2, we would like to construct the induced event 
relation = {E,p^, C^) for tuple t=(Temp: 15) from C2. By Rule 1, we have te^ G 
E, p^{tet) = 0.6. By Rule 2, since t G C2, we have tec^ G .B and p^{tecj) = 
Y^t'eCi P{t') = p((Temp: 22)) = 0.6. Therefore, 



E: 


p^: 


Event 


Prob 




0.6 




0.6 



Proposition 2. An induced event relation in Definition 7 is a simple probabilistic rela- 
tion. 

Evaluating GIobal-Topfc Queries With the help of induced event relations, we can 
reduce Global-Topfc in the general case to Global-Topfc in simple probabilistic relations. 

Lemma 1. Let RP = {R,p, C) be a probabilistic relation, s an injective scoring fiinc- 

tion, t G R, and E'p — {E,p^ ,C^) the event relation induced by t. Define Q^' = 
{E — {tet},p^ — {{tet}})- Then, the Global-Topk probability oft satisfies the 
following: 

V^eepwdiQ") 
\We\<k 

Proof. See Appendix B. 

Propositions. Given a probabilistic relation R^ = {R,p,C) and an injective scor- 
ing function s, for any t G R^, the Global-Topk probability of t equals the Global- 
Topk probability of tg^ when evaluating top-k in the induced event relation E^ = 
{E,p^,C^) under the injective scoring function : E ^ R, s^(tet) = \ and 

Pk,s (*) = ■Pfe,s^=(*et)- 

Proof. See Appendix B. 

In Proposition 3, the choice of the function is rather arbitrary. In fact, any in- 
jective function giving tet the lowest score will do. Every tuple other than tg^ in the 
induced event relation corresponds to an event that a tuple with a score higher than that 
of t occurs. We want to track the case that at most k — 1 such events happen. Since 
any induced event relation is simple (Proposition 2), Proposition 3 illustrates how we 
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can reduce the computation of P^l (t) in the original probabilistic relation to a top-A; 
computation in a simple probabilistic relation, where we can apply the DP technique 
described in Section 4.1. The complete algorithms are shown as Algorithm 3 and Algo- 
rithm 4. 



Algorithm 3 (IndEx_Topk) Evaluate Global-Topfc Queries in a General Probabilistic 
Relation under an Injective Scoring Function 

Require: = (R,p,C),k,s 
1: Initialize a fixed cardinality fc + 1 priority queue Ans of {t,prob) pairs, which compares 

pairs on prob, i.e., the Global-TopA; probabiUty of t; 
2: torteR do 

3: Calculate Pj^l (t) using Algorithm 4, i.e., 

f M {t) = IndEx_Topk.Sub(i^'', fe, s, t) ; 

4: Add {t,Pi^l{t)) to Ans; 
5: if \Ans\ > k then 

6: remove the pair with the smallest prob value from Ans; 
7: end if 
8: end for 

9: return {t\{t, P^^ (t)} € Ans}; 



In Algorithm 4, we first find the part Cjd(t) where t belongs. In Line 2, we initiaUze 
the support relation E of the induced event relation with the tuple generated by Rule 
1 in Definition 7. For any part Ci other than Cij^f^t)^ we compute the probability of the 
event ed according to Definition 7 (Line 4), and add it to E if its probability is non- 
zero (Lines 5-7). Since all tuples from the same part are exclusive, this probability is 
the sum of the probabilities of all qualifying tuples in that part. If no tuple from Ci 
quaUfies, this probability is zero. In this case, we do not care whether any tuple from 
Ci will be in the possible world or not, since it does not have any influence on whether 
t will be in top-fc or not. The corresponding event tuple is therefore excluded from 
E. Note that, by default, any probabiUstic database assumes that any tuple not in the 
support relation is with probability zero. Line 9 uses Algorithm 2 to compute P^l (tet)- 
Note that Algorithm 2 requires all tuples be sorted on score. Since we already know the 
scoring function s^, we simply need to organize tuples based on when generating 
E. No extra sorting is necessary. 

Theorem 3 (Correctness of Algorithm 3). Given a probabilistic relation Rp = {R, p, 

C), a non-negative integer k and an injective scoring function s. Algorithm 3 correctly 
computes a Global-Topk answer set of R^ under the scoring function s. 

Proof. The top-level structure of Algorithm 3 resembles that of Algorithm 1 . Therefore, 
as long as Line 3 in Algorithm 3 correctly computes the Global-Topfc probability of each 
tuple in R, Algorithm 3 returns a valid Global-Topfc answer set. Lines 1-8 in Algorithm 
4 compute the event relation induced by the tuple t. By Proposition 3, Lines 9-10 in 
Algorithm 4 correctly compute the Global-Topfc probability of t. 
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Algorithm 4 (IndEx.Topk_Sub) Calculate P^l (t) using an induced event relation 

Require: FF = {R,p,C),k,s,t ^ R 
1: Find the part Cid(t) G C such that t G Cid(ty, 
2: E = {tej, where p^(iej = p{t); 
3: for deCwdCi^ Cid(t) do 

5: if p(ecj > tlien 

6: = S U {tec, }' where (tec. ) = P(ecJ; 

7: end if 
8: end for 

9: Use Algorithm 2 to compute Global-Topfe probabilities in = {E,p^ , C^), i.e., 
9(0 . . . fe, 1 . . . = Ind_Topk.Sub(£;f , k) 

10: P.«:(t)=P^:.(te,)=g(fc,|i5|); 
11: return P^l{ty, 



In Algorithm 4, Lines 3-8 take 0(n) time to build E (we need to scan all tuples 
within each part). The call to Algorithm 2 in Line 9 takes 0{k\E\), where \E\ is no 
more than the number of parts in partition C, which is in turn no more than n. So 
Algorithm 4 takes 0{kn). Algorithm 3 make n calls to Algorithm 4 to compute P^^ (t) 
for every tuple t E R. Again, Algorithm 3 uses a priority queue to select the final answer 
set, which takes 0{n log k). The entire algorithm takes 0{kn^ + n log k) = 0{kn^). 

A straightforward implementation of Algorithm 3 and Algorithm 4 take 0{kn) 
space, as the call to Algorithm 2 in Algorithm 4 could take up to 0{k\E\) space. How- 
ever, by using a spatially optimized version of Algorithm 2 mentioned in Section 4.1, 
this DP table computation in Algorithm 4 can be completed in 0{k) space. Algorithm 
4 still needs 0{\E\) space to store the induced event relation computed between Lines 
3-8. As li^l has an upper bound n, the total space is therefore 0{k + n). 

4.4 Optimizations for Arbitrary Probabilistic Relations 

In the previous section, we presented the basic algorithms to compute Global-TopA; 
probabilities in general probabilistic relations. In this section, we provide two heuristics. 
Rollback and RollbackSort, to speed up this computation. Our optimizations are similar 
to prefix sharing optimizations in [24], although the assumptions and technical details 
are different. In our terminology, the aggressive and lazy prefix sharing in [24] assume 
the ability to "look ahead" in the input tuple stream to locate the next tuple belonging 
to every part. In contrast. Rollback assumes no extra information, and RollbackSort 
assumes the availability of aggregate statistics on tuples. 

Rollback and RollbackSort take advantage of the following two facts in the basic 
algorithms: 

Fact 1: The overlap of the event relations induced by consecutive tuples; 
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Rollback and RollbackSort are based on the following "incremental" computation 
of induced event relations for tuples in R. By Definition 7, for any tuple t £ R, 
only tuples with a higher score will have an influence on fs induced event relation. 
Given a scoring function s, consider two adjacent tuples U, ti+i in the decreasing 
order of scores. Denote by Ei and -Bj+i their induced event relations under the 
function s respectively. 
Case 1: ti and ti+i are exclusive. 

Then ti and U+i have the same induced event relation except for the one tuple 

generated by Rule 1 in each induced event relation. 

Ei-{te,,} = Ei+,-{te,^^J. (4) 

Case 2: ti and tj+i are independent, and ti+i is independent of ti,. . . ,ti-i as 
well. 

Recall that any tuple tj G Ci^^i.^, 1 < j < i — 1, where Ci^(^f.) is the part 
containing ti, does not contribute to Ei due to the existence of in Ei. Tu- 
ple ti^i is independent of such tuple tj. In -E^+i, instead of , there is an 
event tuple tec. > which corresponds to the event that one tuple from CictUi) 
appears. The second condition guarantees that there is no tuple in Ei — {tet } 
which is incompatible with the event tuple tet ^^^ generated by Rule 1 in 
Therefore, all event tuples in Ei — {tg^, } should be retained in iSj+i. Conse- 
quently, 

Ei - {tet, } = Ei+1 - {*ec,^(j,) ) *et,_^i }• (5) 

Case 3: ti and t^+i are independent, and tj+i is incompatible with at least one 
tuple from ti, . . . ,ti-i. 

In this case, like in Case 2, the first condition guarantees the existence of 
tCia(t ) Ei+i. However, the second condition essentially states that some 
tuple from Cjd(t.^j) has a score higher than that of ti. Thus, there is an event 
tuple tec ™ Ei, which is incompatible with te^ generated by Rule 

1 in Ei+i. As a result, besides the one tuple generated by Rule 1 in each in- 
duced event relation, £^i+i and Ei also differ in the event tuple tec ^nd 

Fact 2: The arbitrary choice of the scoring function in Proposition 3. 

As we can see from Proposition 3, the event tuple t^^ has the same Global-Topfc 
probabiUty in the induced event relation under two distinctive scoring functions as 
long as they both give te^ the lowest score. 



Rollback In Rollback, we use an annotated (k + 1) x n table T° to support two major 
operations for each induced event relation: (1) the creation of the induced event rela- 
tion, and (2) the computation in the dynamic programming (DP) table to calculate the 
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Global-TopA; probability of the tuple inducing it. Each column in T° is annotated with 
{part-id, prob) of an event tuple in the current induced event relation. Each entry (row) 
in the column corresponds to an entry in the DP table when calculating the Global-Top/c 
probabiUties. 

By Fact 7, it is clear that the creation of induced event relations is incremental if 
we do it for tuples in the decreasing order of scores. Fortunately, the decreasing order 
of scores is also used in computing the Global-Topfc probability in each induced event 
relation. Rollback exploits this alignment in order and piggybacks the creation of the 
induced event relation to the computation in the DP table. 

By Fact 2, we can reuse the scoring function to the greatest extent between two 
consecutive induced event relations, and therefore avoid the recomputation of a part of 
the DP table. 

Without loss of generality, assume ti y t2 )~ ■ ■ ■ )~ tn, and the tuple just processed 
is ti, 1 < i < n. By "processed", we mean that there is a DP table for computing 
the Global-Topfc probability, denoted by DPi, where each column is associated with 
an event tuple in tj's induced event relation Ei. Assume \Ei\ = k, then there are k 
columns in DPi. li < i, since only ti,t2, ■ ■ ■ ,ti can contribute to Ei. In fact, k = i 
when all i tuples are independent. In this case, each tuple corresponds to a distinct event 
tuple in Ei. When there are exclusive tuples, li < i. Because in this case, if a tuple from 
ti,t2, ■ ■ ■ , ti-i is incompatible with U, it is ignored due to the existence of tet. in Ei. 
For other exclusive tuples, the tuples from the same part collapse into a single event 
tuple in Ei. Moreover, the probability of such event tuple is the sum of the probabilities 
of all exclusive tuples contributing to it. 

Now, consider the next tuple to be processed, tj+i, its induced event relation 
and the DP table -DP^+i to compute the Global-Topfc probability in Ei+i . If the current 
situation is of Case 1, then Ei and .Ej+i only differ in the event tuple generated by 
Rule 1. Recall that the only requirement on the scoring function used in an induced 
event relation is to assign the lowest score to the event tuple generated by Rule 1 . This 
requirement is translated into the computation in the DP table as associating the tuple 
generated by Rule 1 with the last column. Therefore, we can take the first U — l colunms 
from DPi and reuse them in DP^+i. In other words, by reusing the scoring function in 
DPi as much as possible based on Fact 2, the resulting -DPj+i table differs from DPi 
only in the last column. In practice, -DPi+i is computed incrementally by modifying 
the last column of DPi in place. Denoted by colour the current last column in DPj. In 
DPi^i, colcur should be reassociated with the event tuple tg^,^^ , i.e., 

colour -part Jd = id{ti+i), 
colour -prob = p{ti+i). 

It is easy to see that the incremental computation cost is the cost of computing fc -|- 1 
entries in colour- 

Similarly for Case 2, the first k — 1 columns in DPi can be reused. The two new 
event tuples in DP^+i are ^ec^^j ^ ^i^d tet._^^ . To compute DP^+i, we need to change 
the association of two columns, colour and colour+i - The last colunm in DPi {colour) 
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is reassociated with ten '■ 

colcur-partJd = id{ti), 

colcur-prob= ^ P{tj")- 

tj"eCid(ti) 
i<j"<i 

The last column in fPi+i (colcur+i) is associated with tet.^ 

Colcur+l-PO't'tJd = id{ti+i), 

colcur+i-prob = p{ti+l). 
Example 7. Consider the following data^, and a top-2 query. 





Part 


Score 


Prob. 


h 


Ci 


0.9 


0.3 




C2 


0.8 


0.1 




C3 


0.7 


0.2 


u 




0.6 


0.4 


h 


C2 


0.5 


0.7 



Tuples are processed in the decreasing order of their scores, i.e., ti,t2, ■ ■ ■ , t^. Figure 1 
illustrates each DPi table after the processing of tuple ti . The annotation {partJd, prob) 
of each column is also illustrated. The entry in bold is the Global-Topfc probability of 
the corresponding tuple inducing the event relation. 



col 

k ^\ 


coll 
(1,0.3) 








1 


0.3 


2 


0.3 



(a) DPi 



col 

k ^\ 


coll 
(1,0.3) 


C0I2 
(2,0.1) 











1 


0.3 


0.07 


2 


0.3 


0.1 


(1 


3) DP2 



col 


coll 


C0I2 


COli 


k ^\ 


(1,0.3) 


(2,0.1) 


(3,0.2) 














1 


0.3 


0.07 


0.126 


2 


0.3 


0.1 


0.194 



(C) DP3 



col 


coll 


C012 


coh 


k 


(2,0.1) 


(3,0.2) 


(1,0.4) 














1 


0.1 


0.18 


0.288 


2 


0.1 


0.2 


0.392 



(d) DPi 



^"^^ col 


coll 


C0I2 


coli 


k 


(3,0.2) 


(1,0.7) 


(2,0.7) 














1 


0.2 


0.56 


0.168 


2 


0.2 


0.7 


0.602 



(e) DP5 



Fig. 1. DP table evolution in Rollback 



Take the processing of ^3 for example. Since is independent of ^2 and ti, this is Case 
2. Therefore, the last column in DP2 (C0I2) needs to be reassociated with ter = 

i£i(t2) 

^ We explicitly include partition information into the representation, and thus the horizontal lines 
do not represent partition here. 
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tea, in E3. In DP3, 

col2-partJd — id{t2) — 2, 

col2-prob = p{tjii) = p(t2) = O-l- 

l<j"<2 

The last column in DPs (cols) is associated with the event tuple tg^^ generated by Rule 

1 in £^3: 

col^.partjd — id{t3) = 3, 
cols-prob = p(<3) = 0.2. 

Compared to DP2, the first column with an annotation change in DP^ is C0I3. The DP 
table needs to be recomputed from col-^ (inclusive) upwards. In this case, it is only co/3. 
Notice that, even though the annotation of 00^2 does not change from DP2 to DP^, its 
meaning changes. In DP2, C0I2 is associated with tgt^ in E2 instead. 

In Case 1 and Case 2, the event tuple which we want to "erase" from Ei, i.e., , is 
associated with the last column in DPi. In Case 3, by Equation (6), we want to "erase" 
from Ei the event tuple ten in addition to tg, ■ Assume ten is associated 

with colj in DPi, and the colunms in DPi are 

coll, • • • ) Colj-i, colj, CoZj+i, . . . , Colcur-l-i Colcur 

which correspond to 

»1 »j + i »o«»r-l ' 

in Ei respectively. Obviously, ij = id{ti+i). By Equation (6), 

Ei+l = {tecf, ) • • • ) tec, . , ' tec, tec, , > ^^c.,,,, -, ' ^et; }• 

By Facf 2, as long as te^. is associated with the last column in DPi+i, the colunm 
association order of other tuples in Ei+i does not matter in computing the Global-TopA; 
probabihty of t,. By adopting a column association order such that 

tecj, 5 • • • ) tec, . , 

is associated with 

coZi, . . . , colj-i 

respectively in DPi+i, we can reuse the first j — 1 columns already computed in DPi. 
In our DP computation, the values in a column depend on the values in its previous 
colunm. Once we change the values in colj, every colji, j' > j, needs to be recomputed 
regardless. Therefore, the recomputation cost is the same for any colunm association 
order of event tuples 
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In Rollback, we simply use this order above as the column association order. In fact, 
the name of this optimization, Rollback, refers to the fact that we are "rolling back" the 
computation in the DP table until we hit colj and recompute all the columns with an 
index equal to or higher than j. 

Example 8. Continuing Example 7, consider the processing of t5. is independent of 
ti, while and t'2 are exclusive. Therefore, this is Case 3. We first locate colj associated 
with ^ec^(j ) = tec2 ™ DP4. In this case, it is coli. Then, we roll all the way back to 
coh in DP4, erasing every column on the way including coh. As colj = coh, there 
is no column from DP4 that we can reuse in DP^. We move on to recompute colj', 
j < j'> in DP5 that are associated with tec^ and tec■^^^ , = *eci • ^ particular, C0I2 in 
DP5 is associated with tec^ ■ Thus, 

col2-partJ,d = 1, 

col2-prob= P{tj") 

ty/eCi,l<j"<4 

= 0.3 + 0.4 
= 0.7. 

The last column in DP5 is again associated with the event tuple tet generated by Rule 

1 in E5. 

Out of the five tuples, the processing of ti,t2, h is of Case 2, and the processing 
of ti, t5 is of Case 3. Whenever we compute/recompute the DP table, the event tuples 
associated with the columns are from the induced event relation, and therefore indepen- 
dent. Thus, every DP table computation progresses in the same fashion as that with the 
DP table in Example 5. 

Finally, we keep the Global-Top2 probability of each tuple (from the original prob- 
abilistic relation) in a priority queue. When we finish processing all the tuples, we get 
the top-2 winners. In this example, the priority queue is updated every time we get an 
entry in bold. The winners are and t4 with the Global-Top2 probability 0.602 and 
0.392 respectively. 

RoUbackSort For the rollback operation in Case 3 of Rollback, define its depth as the 
number of columns recomputed in rolling back excluding the last column. For example, 
when processing tr^ in Example 8, coli , C0I2 and 00^3 are recomputed in DP^. Therefore 
the depth of this rollback operation is 3 — 1 = 2. 

Recall that in Case 3 of Rollback, we adopt an arbitrary order 

^ecj 5 • • • ) teci ' ^^C j,, ^ 

+ l 'cur-l »<i(ti) 

to process those event tuples in DPj+i. The Global-Topfc computation in .Bj+i does 
not stipulate any particular order over those tuples. Any permutation of this order is 
equally valid. The intuition behind RollbackSort is that we will be able to find a permu- 
tation that will reduce the depth of future rollback operations (if any), given additional 
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statistics on the probabilistic relation BP, namely the count of the tuples in each part 
of the partition. Theoretically, it requires an extra pass over the relation to compute the 
statistics. In practice, however, this extra pass is often not needed because this statistics 
can be precomputed and stored. 

In RollbackSort, if the current situation is Case 3, we do a stable sort on 

in the non-decreasing order of the number of unseen tuples in its corresponding part, 
and then use the resulting order to process those event tuples. The intuition is that each 
unseen tuple has the potential to trigger a rollback operation. By pushing the event tuple 
with the most unseen tuples close to the end of the current DP table, we could reduce 
the depth of future rollback operations. In order to facilitate this sorting, we add one 
more component unseen to the annotation of each colunm. 

Example 9. We redo the problem in Example 7 and Example 8 using RollbackSort. 
Now, the annotation of each column becomes {partJd,prob, unseen). The evolution 
of the DP table is shown in Figure 2. In RollbackSort, the statistics on all parts are 
available: 2 tuples in Ci, 2 tuples in C2 and 1 tuple in C3. 
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Fig. 2. DP table evolution in RollbackSort 



Consider the processing of f 1 in DPi . As we just see one tuple ti from Ci , there is one 
more unseen tuple from Ci coming in the future. Therefore, coli. unseen = 1. AH the 
other unseen annotations are computed in the same way. 

When processing t4 (Case 3), the colunm associated with tec ^ is coh. 
We roll back to coli as before and recompute all the columns upwards in DP/i. Notice 
that, the recomputation is performed in the order tec^ > tec^ > contrast to the order tec^ , 
tec^ used in Example 8 (Figure 1(d)). C2 has one more unseen tuple which can trigger 
the rollback operation while there are no more unseen tuples from C3. The benefit of 
this order becomes clear when we process t^. We only need to rollback to C0I2 in DP^. 
The depth of this rollback operation is 1. Recall that the depth of the same rollback 
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operation is 2 in Example 8. In other words, we save the computation of 1 column by 
applying RollbackSort. 

Rollback and RollbackSort significantly improve the performance in practice, as we 
will see in Section 6. The price we pay for this speedup is an increase in the space 
usage. The space complexity is 0{kn) for both optimization. The quadratic theoretic 
bound on running time remains unchanged. 

5 Global-Topfc under General Scoring Functions 
5.1 Semantics and Postulates 

Global-Topfe Semantics witli Allocation Policy Under a general scoring function, 
the Global-Topfc semantics remains the same. However, the definition of Global-TopA; 

probability in Definition 5 needs to be generalized to handle ties. 

Recall that under an injective scoring function s, there is a unique top-fc answer set 
S in every possible world W. When the scoring function s is non-injective, there may be 
multiple top-fc answer sets Si, ... , Sd, each of which is returned nondeterministically. 
Therefore, for any tuple t & riSi,i = 1, ... ,d, the world W contributes Pr{W) to the 
Global-Topfc probability of t. On the other hand, for any tuple t G {USi — CiSi), i = 
1 . . . ,d, the world W contributes only a fraction of Pr{W) to the Global-Topfc proba- 
bihty of t. The allocation policy determines the value of this fraction, i.e., the allocation 
coefficient. Denote by a{t, W) the allocation coefficient of a tuple f in a world W. Let 
allk,s{W) =USi,i = l,...,d. 

Definition 8 (Global-Topfc Probability under a General Scoring Function). Assume 

a probabilistic relation RP = {R, p, C), a non-negative integer k and a scoring function 
s over Rf. For any tuple t in R, the Global-Topk probability oft, denoted by P^l {t), is 
the sum of the (partial) probabilities of all possible worlds of Rp whose top-k answer 
set may contain t. 

Ptsit)= E c.(.t,W)Pr{W). (7) 

teallk^siW) 

With no prior bias towards any tuple, it is natural to assume that each of Si, . . . ,Sd 
is returned nondeterministically with equal probability. Notice that this probability has 
nothing to do with tuple probabilities. Rather, it is determined by the number of equally 
qualified top-A; answer sets. Hence, we have the following Equal allocation poUcy. 

Definition 9 {Equal AllocationPolicy). Assume a probabilistic relation R^ = {R,p,C), 
a non-negative integer k and a scoring function s over R^. For a possible world W G 
pwd{RP) anda tuple tGW,leta= \{t' e W\t' >-« t}\ and b = \{t' € W\t' ~s t}\ 

{1 if a < k and a-\-b <k 

k-a , , , , 

— ; — ifa<k and a-\-b> k 
b 
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This notion of Equal allocation policy is in the spirit of uniform allocation policy 
introduced in [26] to handle imprecision in OLAP, although the specified goals are 
different. Note that [26] also introduces other allocation policies based on additional 
information. In our application, it is also possible to design other allocation policies 
given additional information. 

Satisfaction of Postulates The semantic postulates in Section 3.1 are directly applica- 
ble to Global-Top/c with allocation policy. In the Appendix A, we show that the Equal 
allocation poHcy preserves the semantic postulates of Global-Topfc. 

5.2 Query Evaluation in Simple Probabilistic Relations 

Definition 10. Let RP = {R,p, C) be a probabilistic relation, k a non-negative integer 
and s a general scoring function over Rp. Assume that R = {ti, t2, • • • , tn}, ti >Zs 
t2 hs ■ ■ ■ hs tn- Let T^^^^y k < i, be the sum of the probabilities of all possible worlds 
of exactly k tuples from {ti, . . . ,ti}: 

Wepwd{R^) 

\wn{ti,...,ti}\=k 

As usual, we omit the superscript in T^L, i.e., Ti^,[i], when the context is unam- 
biguous. Remark 1 shows that in a simple probabilistic relation [j] can be computed 
efficientiy. 

Remark 1. Let R^ = {R, p, C) be a simple probabilistic relation, k a non-negative in- 
teger and .s a general scoring function over R^. Assume that R = {ti,t2, ■ • • ,tn}, 
h hs t2 hs ■ ■ ■ hs tn- For any i, 1 < i < n — 1, Tj^^ui can be computed using the 
DP table for computing the Global-Top/c probabilities in R^ under an order-preserving 
injective scoring function s' such that ti Xg' t2 ^s' ■ ■ ■ ^s' tn- 

Proof. By case study, 

- Case 1: If fc = 0, 1 < i < n - 1, then 

T.^k = n ^(*.) 
i<j<i 

- Case 2: For every l<A;<i<n-l,bythe definition of T^^^ , we have 

\wn{ti,...,u}\<k |wn{fi,...,ti}|</c-i 



p{ti+l) 



(8) 
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In the DP table computing the Global- Topfc probabihties in Rp under function s', 
we have 



Pk+i,Ati+i)= E (s'isinjective) 

WepwdiR") 

^ Pr{W) 

WepwdiR") 

\wn{ti,...,ti}\<k 

= p{ti+i) ^ Pr{W) (tuples are independent) 

WepwdiR") 

\wn{ti,...,ti}\<k 



Therefore, 



pR" 



Pk+i,s'iU+i) PF^,{U+i) 



P{ti+i) 



p(ti+i) 



(9) 



Since l<fc<i<n — 1, both P^^^ g, (ij+i) and P^J, (ii+i) can be computed by 
the DP table used to compute the Global- Topfc probabihties of tuples in Rp under 
the injective scoring function s'. 



Remark 2 shows that we can compute Global-Topfc probability under a general 
scoring function in polynomial time for an extreme case, where the probabilistic relation 
is simple and aU tuples tie in scores. As we will see shortiy, this special case plays an 
important role in our major result in Proposition 4. 



Remark 2. Let Rp = {R,p,C) be a simple probabilistic relation, k a non-negative in- 
teger and s a general scoring function over Rp. Assume that R = {tx,. . . , tm} and 
ti ^2 ~s • • • ~s tm- For any tuple U,! < i < m, the Global-Topfc probabihty of U, 
i.e., Pk'liti), can be computed using Remark 1. 



Proof. If k > m, it is trivial that P^l (ti) = p{ti). Therefore, we only prove the case 
when k <m. According to Equation (7), for any i,l<i<m, 
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PkZ(ti)=Jl E a{U,W)Pr{W) 



j=l WepwdiR") 

tieallk,s{W),\W\=j 



= Y^ J2 a{ti, W)Pr{W) (Since all tuple tie , alhAW) = W) 



j=l Wepwd{R") 

i,ew,\w\=j 



J2 E a{ti,W)Pr{W) + J2 E a{ti,W)PriW) 

j=l WepwdiR") j=k+l WepwdiR") 

tiew,\w\=j tiew,\w\=j 



= E E E 7 E p^w) 

j=l WepwdiR") j=k+l •' WepwdiR") 

tiew,\w\=j tiew,\w\=j 
With out loss of generality, assume i = m, then the above equation becomes 



k m , 

<(u = E E p<w)+j2^ E P'iw) 

j=l WepwdiR") j=k+l •' WepwdiR") 

t,new,\w\=j t,new,\w\=j 

k m J 

= P{U){T.Tf-Um-n+ E (10) 
j=l j=k+l ■' 

By Remark 1, every T^^^ [m-iy — l^w — 1, can be computed by the DP 

table computing Global-Topfc probabilities in under an order preserving injective 
scoring function s', and Equation (8) or (9). Therefore, Equation (10) can be computed 
using Remark 1 . 

Based on Remark 1 and Remark 2, we design Algorithm 5 and prove its correctness 
in Theorem 4 using Proposition 4. 

Assume = {R,p,C) where R = {ii,i2, • • • ,tn} and ti ^2 hs ■ ■ ■ hs tn- 
For any ti G R, ii is the largest index such that ti, )^sU, and ji is the largest index such 
thattj, >sti. 

Intuitively, Algorithm 5 and Proposition 4 convey the idea that, in a simple proba- 
bilistic relation, the computation of Global-TopA: under the Equal allocation policy can 
be simulated by the following procedure: 

(51) Independently flip a biased coin with probability p{tj) for each tuple tj & R = 
{ti,t2, . . . , tn}, which gives us apossible world W e pwd{R^); 

(52) Return a top-fc answer set S of W nondeterministically (with equal probability 
in the presence of multiple top-A; sets). The Global- Topfc probabihty of ti is the 
probability that ti G S. 
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The above Step (SI) can be further refined into: 

(51.1) Independently flip a biased coin with probability p{tj) for each tuple tj G Ra = 

{ti,t2 ■ ■ ■ , ti, }, which gives us a collection of tuples Wa', 

(51. 2) Independently flip a biased coin with probabihty for each tuple tj G Rb = 
{ti,+i, . . . , tn}, which gives us a collection of tuples Wb- W = Wa U Wb is a 
possible world from pwd{R^); 

In order for ti to be in S, Wa can have at most A; — 1 tuples. Let \ Wa \ = k', then 
k' < k. Every top-fc answer set of W contains all k' tuples from Wa, plus the top- 
(fc — k') tuples from Wb- For ti to be in S, it has to be in the top-(fc — k') set of Wb- 
Consequently, the probability of ti e S, i.e., the Global- Topfc probability of ti, is the 
joint probabihty that \Wa\ = k' < k and ti belongs to the top-(fc - k') set of Wb- The 

former is 7fc/ jj,] and the latter is -P^Jfe/ s(iO > where Rg is i?^ restricted to Rb- Again, 
due to the independence among tuples, Step (Sl.l) and Step (SI. 2) iire independent, 
and their joint probability is simply the product of the two. 

Further notice that since ti has the highest score in Rb and aU tuples are independent 
in Rb, and any tuple with a score lower than that of ti does not have an influence on 

PkK'A^^y ''ther words, P^K'^^i) = Pk^kii^i)' where RP{ti) is Rp restricted to 

(t ^ 

all tuples tying with ti in R. Notice that the computation of P^^^/ ^(t;) is the extreme 
case addressed in Remark 2. 

Algorithm 5 elaborates the algorithm based on the idea above, where m = ji— ii is 
the number of tuples tying with ti (including ti)- 

Furthermore, Algorithm 5 exploits the overlapping among DP tables and makes the 
following two optimizations: 



1. Use a single DP table to collect the information needed to compute all Tfc'.fij], 
k' = 0,..., k- 1, 1 = 1,..., n and k' <ii (Line 2). 

Notice that by definition, when 1 < I < n, 1 < ii < n — Litis easy to see that the 
DP table computing Tfe_i subsumes all other DP tables. 

2. Use a single DP table to compute aU fc' = 0, . . . , A; — 1, for a tuple ti 
(Lines 8-14). 

it ^ 

Notice that in Equation (10), for different k', the computation of Pf.^f.ig{ti) re- 
quires the same set of values (Lines 9-11). In Line 13, P^_^j^,'l{ti) is ab- 
breviated as Pi{k"), where k" = k — k', to emphasize the changing parameter 
k'- 



Each DP table computation uses a call to Algorithm 2 (Line 2 in Algorithm 5, Line 
3 in Algorithm 6). 
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Algorithm 5 (Ind_Topk_Gen) Evaluate Global-Topfc Queries in a Simple Probabilistic 
Relation under a General Scoring Function 

Require: RF = {R,p,C),k 

Ensure: tuples in R are sorted in the non-increasing order based on the scoring function s 
1: Initialize a fixed cardinality (fc + 1) priority queue Ans of (t,prob} pairs, which compares 

pairs on prob, i.e., the Global-Topfc probability of i; 
2: Get the DP table for computing T^.i,[i\ ,k' = 0, . . .k — l,i = 1, . . . , n — 1, fe' < i using 
Algorithm 2, i.e., 

q{0...k,l... \R\) = Ind-Topk-SubCi?", k); 

3: for Z = 1 to \R\ do 

4: m = ji — if, 

5: if m == 1 tlien 

6: Add {ti,q{k, I)} to Ans; 

7: else 

8: Get the DP table for computing j^J'^ (ti), i.e., Pi{k - k'), k' = 0, . . . ,k - 1 
qtie(0 ... m, 1 ... m) = Ind_Topk_Gen_Sub(7is(ti), ij, m); 



9: for fc" = to m - 1 do 
10: 

rpRP(t,) ^ qtiejk" + l,m) - que{k",m) 

k",[m-l] - 

1 1 : end for 

12: for k" = 1 to fc do 



K m J ff 



j=l j=k"+l 

end for 

Pm(^i) = 0; 

for fe' = to A; - 1 do 

rj. _ q{k' + l,ii + l)-q{k',ii + l) 
p{tH+^) 

PCiti) = PCiti) + Tk'Ai,] ■ Pi{k - k'y, 

end for 

Add {ti,Pf,{ti)) to Ans; 
end if 

if I Ans I > fethen 

remove the pair with the smallest prob value from Ans; 
end if 
end for 

return {ti\(ti,prob) € Ans}; 
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Algorithm 6 (Ind_Topk_GenJSub) Compute the DP table for Global-Topfc probabili- 
ties in a Simple Probabilistic Relation under an All-Tie Scoring Function 

Require: R^ittarget) = {R,P,C),ttarget,m 

Ensure: |7?| = m, ttarget £ R 
1: Rearrange tuples in R such that R = {ti , . . . , tm-i, tm} and tm = ttarget, 
2: Assume the injective scoring function s' is such that ti >-gi . . . >-si tm-i ^s' Uarge-u 
3: Get the DP table 

gtie(0 . . . m, 1 . . . m) = Ind_Topk_Sub(7??(ifar9et), m); 
4: return gtie(0 . . . m, 1 . . . m); 



Proposition 4. Le? i?*" = {R,p,C) be a simple probabilistic relation where R = 

{t\, . . . ,tn\, ti ^2 • • • tn, k fl non-negative integer and a a scoring function. 
For every ti £ R, the Global-Topk probability ofti can be computed by the following 
equation: 

Pk^:itl)=J2Tk'An]-Pk^k'%) (11) 

fc'=0 

where i?f (t;) is R^ restricted to {t G R\t ~s ti}. 
Proof. See Appendix B. 

Theorem 4 (Correctness of Algorithm 5). Given a probabilistic relation Rp — {R, p, 
C), a non-negative integer k and a general scoring function s. Algorithm 5 correctly 
computes a Global-Topk answer set ofR^ under the scoring function s. 

Proof. In Algorithm 5, by Remark 1, Line 2 and Line 17 correctly compute ?fc/ [jj for 
< k' < k — 1, 1 < i < n — 1, k' < i. The entries in Line 8 serve to compute 
Line 10 by Equation (9). Recall that Rs{ti) is R^ restricted to all tuples tying with ti, 
which is the extreme case addressed in Remark 2. By Remark 2, Line 8 collects the 
information to compute P^l^'liU), i.e., Pi{k"), 1 < k" = k - k' < k. Lines 12-14 
correctly compute those values by Equation (10). Here, any non-existing 
i.e., j — 1 ^ [0, TO — 1], is assumed to be zero. By Proposition 4, Lines 15-19 correctly 
compute the Global-Top/c probability of ti. Also notice that in Line 6, the Global-Topfc 
probability of a tuple without tying tuples is retrieved directly. It is an optimization as 
the code handling the general case (i.e., to > 1, Lines 8-20) works for this special case 
as well. Again, the top-level structure with the priority queue in Algorithm 5 ensures 
that a Global-Topfc answer set is correctly computed. 

In Algorithm 5, Line 2 takes 0{kn), and for each tuple, there is one call to Algo- 
rithm 6 in Line 8, which takes 0{m'^^^), where m^s.yi is the maximal number of tying 
tuples. Lines 9-11 take 0(mi„ax). Lines 12-14 take O(fcTOmax). Therefore, Algorithm 
5 takes 0(nmax(fc, m^a^j)) altogether. 
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As before, the major space use is the computation of the two DP tables in Line 
2 and Line 8. A straightforward implementation leads to 0{kn) and O(m^a^x) space 
respectively. Therefore, the total space is 0(nmax(A;, mmax))- Using a similar space 
optimization in Section 4.1, the space use for the two DP tables can be reduced to 0{k) 
and O(mmax). respectively. Hence, the total space is 0(max(A;, mmax))- 



5.3 Query Evaluation in General Probabilistic Relations 

Recall that under an injective scoring function, every tuple t in a general probabihstic 
relation Rp = {R, p, C) induces a simple event relation E^, and we reduce the com- 
putation of t's Global-Topfc probability in Rp to the computation of t^^ 's Global-Topfc 
probability in E^. 

In the case of general scoring functions, we use the same reduction idea. However, 
now for each part Ci <^C,Ci^ Gid(t)^ tuple t induces in E'p two exclusive tuples ^ 
and tec. ,~ ' corresponding to the event eci,y that "there is a tuple from the part Cj with 
a score higher than that of t" and the event ed,^ that "there is a tuple from the part 
Ci with a score equal to that of f\ respectively. In addition, in Definition 1 1, we allow 
the existence of tuples with probabihty 0, in order to simphfy the description of query 
evaluation algorithms. This is an artifact whose purpose will become clear in Theorem 
5. 

Definition 11 (Induced Event Relation under General Scoring Functions). Given 
a probabilistic relation R^ = {R, p, C), a scoring function s over Rp and a tuple 
t e Cid{t) S C, the event relation induced by t, denoted by E^ = {E,p^ ,C^), is a 
probabilistic relation whose support relation E has only one attribute, Event. The re- 
lation E and the probability function p^ are defined by the following four generation 
rules and the postprocess step: 

- Rule 1.1: tet,^ e E andp^{te^^^ ) = p{t); 

- Rule 1.2: te^ y e E andp^lte^y. ) = 0; 

- Rule 2.1 : 

VC, G C A C, ^ Qd(().(tec.,^ e E) andp^{tec„y) = Et'eC.P(i'); 

- Rule 2.2: 

^Ci&ChCi^ Cid(t)-{tec,.^ G E) ana! (tec, ,^) = Et'ec,P(i')- 
Postprocess step: only when p^{tec.,y) and p^{tec^,'^) ^''^ i>oth 0, delete both 

iec- a^ tec- ,~' 

Proposition 5. Given a probabilistic relation R^ = {R,p,C) and a scoring function s, 
for any t G RP, the Global-Topk probability oft equals the Global-Topk probability of 
iet,~ when evaluating top-k in the inducedevent relation E^ = {E,p^,C^) under the 
scoring function : E ^ R, s^{tet,y) = 5, s^(tet,'-) = 5. ^^(*ec-,~) = 5 
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Proof. See Appendix B. 

Notice that the induced event relation in Definition 11, unlike its counterpart 
under an injective scoring function, is not simple. Therefore, we cannot utihze the algo- 
rithm in Proposition 4. Rather, the induced relation is a special general probabilistic 
relation, where each part of the partition contains exactly two tuples. Recall that we 
allow tuples with probability now. For this special general probabilistic relation, the 
recursion in Theorem 5 (Equation (12), (13)) collects enough information to compute 
the Global-Topfc probability of te^,^ in E^ (Equation (14)). 

Definition 12 (Secondary Induced Event Relations). Let Ep = {E,p^,C^) be the 
event relation induced by tuple t under a general scoring function s. Without loss of 
generality, assume 

E = {ieci,!-)^eci,~> • • • '^ec^_i,!->^ec„_i,~>^et,^)iet,^}) 

and we can split E into two non-overlapping subsets Ey and E^ such that 

Ey = {^eci.y ) • • • ) ^ec„_i,y ) tet,>- }) 
E^ = {teci,~> ■ ■ ■ ' ^et,~}' 

The two secondary induced event relation Ef_ and EZ are E^ restricted to Ey 

and respectively. They are both simple probabilistic relations which are mutually 
related. For every 1 < i < m — 1, the tuple U^y (U^r^ resp.) refers to t^c^^y (^^eci,~ 
resp.). The tuple tm,y (tm,'^ resp.) refers to te^^y. (tet,^ resp.). 

In spirit, the recursion in Theorem 5 is close to the recursion in Proposition 1, even 
though they are not computing the same measure. The following table does a compari- 
son between the measure q in Proposition 1 and the measure u in Theorem 5: 



Measure 




\{tj\tj e W, 
j < i,tj ~s t}\ 


q{k,i) 


(1) W contains ti 

(2) W has no more than k tuples from {ti,t2, ■ ■ ■ ,ti} 




Uy/^{k,i,b) 


il)W contains U 

(2) W has exactly k tuples from {ti,t2, ■ ■ ■ ,ti} 


b 



Under the general scoring function s , a possible world of an induced relation E^ 
may partially contribute to the tuple tm,r^'s Global-Topfc probability. The allocation 
coefficient depends on the combination of two factors: the number of tuples that are 
strictly better than t^^^ and the number of tuples tying with tm,r^- Therefore, in the 
new measure u, first, we add one more dimension to keep track of b, i.e., the number of 
tying tuples of a subscript no more than i in a world. Second, we keep track of distinct 
(A:, b) pairs. Furthermore, the recursion on the measure u differentiates between two 
cases: a non-tying tuple (handled by Uy) and a tying tuple (handled by u^), since those 
two types of tuples have different influences on the values of k and b. 

Formally, let uy {k', i, b) {Ur^{k' , i, b) resp.) be the sum of the probabilities of all the 
possible worlds W of E^ such that 
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1. ti^y G WiU^r^ e VTresp.) 

2. i is the fc'th smallest tuple subscript in worid W 

3. the worid W contains b tuples from E^^ with subscript less than or equal to i. 

The equations (12) and (13) resemble Equation (3), except that now, since we in- 
troduce tuples with probability to ensure that each part of has exactly two tuples, 
we need to address the special cases when a divisor can be zero. Notice that, for any 
i,l < i < m, at least one oip^{ti^y) andp^(ti,^) is non-zero, otherwise, they are not 
in EP by definition. 

Theorems. Given a probabilistic relation BP = {R,p,C), a scoring function s, t & 
RF, and its induced event relation = {E,p^ ,C^), where \E\ = 2m, the following 
recursion on Uy{k', i, b) andu^{k' , i, b) holds, where bmax is the number of tuples with 
a positive probability in E^. 

When i = 1,0 < k' < m and <b < 6max, 

Uyik\l,b) = [f^*^'-^ 



u.ik',l,b) = {f^*'-^ 
For every i, 2 < i < m, < k' < m and <b <b, 



Uy{k' , i, b) = 


(12) 


Condition 


Formula 


k' = 





l<k' < m,p^{ti-i,y) > 


+ Uy{k' - 1,6) 

+U^ik' - 1,1-1, b))p^{ti,y) 


1 < A;' < m,p^{ti-i,y) = 
and <b < 6max 




+uy{k' - l,i - 1,6) 

+U^ik' -l,i-l,b))p''iU,y) 


1 < fc' < m,p^{ti-i^y) = 
and b = 6max 


{Uy{k' -l,i-l,b)+ U^{k' -l,i-l, b))p^{ti,y) 



fc' = 1,6 = 
otherwise 

fc' = 1,6= 1 
otherwise 
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u^{k',i,b)= (13) 



Condition 


Formula 


k' = Oorb = 





l<k'<m„l<b< femax 

and {U-i^r^) > 


... . 1 l-p^{U-,,y)-p^{U-,,^) 

+uy{k' -l,i-l,b-l) 
+uS{k' -l,i-l,b-l))p^iti^^) 


1 < k' < m, 1 <b < 6max 

and p^ = 


- 1,6- 1) 
+u^(fc'-l,i-l,6-l))p^(ii,^) 



77ie Global-Topk probability oftet,r^ in Ep under the scoring function s can be 
computed by the following equation: 



bmax fc fc-|-6 — 1 1 (if L\ 

= E(E"-(^''^''')+ E I ~ ' u^{k',m,b)) (14) 

6=1 fe'=l fe'=fc+l 

Proof >See Appendix B. 

Recall that we design Algorithm 1 based on the recursion in Proposition 1. Simi- 
larly, a DP algorithm based on the mutual recursion in Theorem 5 is available. We are 
going to skip the details. Instead, we show how the algorithm works using Example 10 
below. 

The time complexity of the recursion in Theorem 5 determines the complexity of 
the algorithm. It takes ©(femaxJ^^) for one tuple, and 0(mmaxf^^) for computing all 
n tuples. Recall that mmax is the maximal number of tying tuples in R, and thus 
6max < "^max- Again, the priority queue takes 0{nlogk). Altogether, the algorithm 
takes 0(TOi„ax'^^) time. 

The space complexity of this algorithm is 0(6max?^^) in a straightforward imple- 
mentation and 0{binax'n) if space optimized as in Section 4.1. 

Example 10. When evaluating a top-2 query in W = {R,p, C), consider a tuple t € R 
and its induced event relation EP = {E, , C^) 



Ey 


^<^Ci,y 


^ec2,y 


^ec3,y 
its) 


(tr) 


E^ 


(*2) 




(*6) 


its) 


p^ 


0.6 


0.5 


0.2 










0.25 


0.6 


0.4 



In order to compute the Global-Top/c probability of (i.e., fet,~) in E^, Theorem 5 
leads to the following DP tables, each for a distinct combination of a value of b and a 
secondary induced relation, where 6max = 3. 
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{^)(b = 0,El) (h){b=l,El) {c)(b = 2,El) {d)(b = 3,El) 
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k\t 


f2 


t4 


Ui 


ts 

















1 





0.1 


0.06 


0.008 


2 





0.15 


0.21 


0.036 


3 








0.18 


0.052 


4 











0.024 



(e) (6 = 0, El) 



(f) (& = 1, El) 



k\t 


t2 


t4 


Ui 


in 
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0.06 


0.032 


3 








0.09 


0.104 


4 











0.084 



(g) (6 = 2, i;^) 



k\t 


t2 


t4 


Ui 


ts 
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0.024 


4 











0.036 



(h) {b = 3, El) 



Fig. 3. Mutual Recursion in Example 10 



The computation of each entry follows the mutual recursion in Theorem 5, for ex- 
ample, 

uy{2, 5, 0) = {uyil, 3, 0) + u.(l, 4, 0) + uyi2, 3, 0)l:iP^Mf^^)pi^{t,) 



= (0.2 + + 1 0-5 0-25 ^ ^ ^ 



«.(2, 6, 1) = Ml, 3, 0) + u^{l, 4, 0) + u^{2, 4, 



(0.2 + + 0.15^4:^-^)0.6 = 0.21 



Finally, under the scoring function defined in Proposition 5 



2+b-l 



b=l k' = l fc'=2+l 

= u^(l,8,l) + u^(2,8,l) 

+w^(l, 8, 2) + u^(2, 8, 2) + iu^(3, 8, 2) 

+u^(l, 8, 3) + u^{2, 8, 3) + ^u^(3, 8, 3) + \u^{3, 8, 4) 

= 0.008 + 0.036 + + 0.032 + ^0.104 + + + ^0.024 + ^0.036 
= 0.156 
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Bold entries in Figure 3 are involved in the above equation. 
6 Experiments 

We report here an empirical study on various optimization techniques proposed in Sec- 
tion 4.2 and Section 4.4, as the behavior of the straightforward implementation of our 
algorithms is pretty much predicted by the aforementioned theoretical analysis. We im- 
plement all the algorithms in C++ and run experiments on a machine with Intel Core2 
1.66G CPU running Cygwin on Windows XP with 1GB memory. 

Each synthetic dataset has a uniform random score distribution and a uniform ran- 
dom probability distribution. There is no correlation between the score and the prob- 
ability. The size (n) of the dataset varies from 5K up to IM. In a dataset of a general 
probabilistic relation, x is the percentage of exclusive tuples and s is the max number 
of exclusive tuples in a part from the partition. In other words, in a general probabiUstic 
relation of size n, there are [nx] tuples involved in a non-trivial part from the partition. 
The size of each part is a random number from [2, s] . Unless otherwise stated, x defaults 
to 0.1 and d defaults to 20. The default value of A; in a top-fc query is 100. 

For simple relations, the baseline algorithm Basic is the space optimized version 
of Algorithm 1 and 2 mentioned in Section 4.1. TA integrates the TA optimization 
technique in Section 4.2. For general relations, the baseline algorithm Reduction is a 
straightforward implementation of Algorithm 3 and 4. Rollback and RollbackSort im- 
plements the two optimization techniques in Section 4.4 respectively. 

Summary of experiments We draw the following conclusions from the forthcoming 
experimental results: 

• Optimizations such as TA, Rollback and RollbackSort are effective and significantly 
reduce the running time. On average, TA saves about half of the computation cost 
in simple relations. Compared to Reduction, Rollback and RollbackSort improve 
the running time up to 2 and 3 orders of magnitude respectively. 

• Decreasing the percentage of exclusive tuples {x) improves the running time of 
Rollback and RollbackSort. When x is fixed, increasing the max number of tuples 
in each part (,s) improves the running time of Rollback and RollbackSort. 

• For general probabilistic relations, RollbackSort scales well to large datasets. 

6.1 Performance of Optimizations 

Figure 4(a) illustrates the improvement of TA over Basic for simple probabilistic rela- 
tions. While Basic is already Unear in terms of n, TA still saves a significant amount 
of computation, i.e., a little less than half. It worths emphasizing that there is no cor- 
relation between the score and the probability in our datasets. It is well-known that 
TA optimization has a better performance when there is a positive correlation between 
attributes, and a worse performance when there is a negative correlation between at- 
tributes. Therefore, the dataset we show, i.e., with no correlation, should represent an 
average case. 
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200K 400K 600K 800K 1000K 5K 20K 40K 60K 80K 100K 

Tuple (n) Tuple (n) 

(a) Simple Prob. Relation (b) General Prob. Relation 

Fig. 4. Performance of Optimizations 




Fig. 5. Sensitivity to Parameters 
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For general probabilistic relations, Figure 4(b) illustrates the performance of Re- 
duction, Rollback and RollbackSort when n varies from 5K to lOOK. For the baseline 
algorithm Reduction, we show only the first three data points, as the rest are off the 
chart. The curve of Reduction reflects the quadratic theoretical bound. From Figure 
4(b), it is clear that the heuristic Rollback and RollbackSort greatly reduce the running 
time over the quadratic bound. The improvement is up to 2 and 3 orders of magnitude 
for Rollback and RollbackSort respectively. 

6.2 Sensitivity to Parameters 

Our second set of experiments studies the influence of various parameters on Rollback 
and RollbackSort. The results are shown in Figure 5. Notice the difference between the 
scale of y-axis of Figure 5(a) (resp. Figure 5(c)) and that of Figure 5(b) (resp. Figure 
5(d)). RollbackSort outperforms Rollback by one order of magnitude. 

Figure 5(a) and 5(b) show the impact of varying the percentage of exclusive tuples 
{x) in the dataset. It is to be expected that with the increase of the percentage of ex- 
clusive tuples, more rollback operations are needed in both Rollback and RollbackSort. 
However, Rollback shows a linear increase, while RollbackSort shows a trend more than 
linear but less than quadratic. 

Figure 5(c) and 5(d) illustrate the impact of the size of the parts in the partition. In 
these two sets of experiments, we fix the total number of exclusive tuples, and vary the 
max size of a part (s). A large s suggests fewer but relatively larger parts in the partition, 
as compared to a small s. For both Rollback and RollbackSort, we see a similar trend 
that as s increases, the running time decreases. The relative decrease in Rollback is 
larger than that of RollbackSort, which can be explained by the fact that RollbackSort is 
already optimized for repetitive occurrences of tuples from the same part, and therefore 
it should be less subjective to the size of parts. 

6.3 Scalability 



RollbackSort — \ — RollbackSort 




200K 400K 600K 800K 1000K 100 200 300 400 500 600 700 800 9001000 

Tuples (n) k 

(a) Running time vs n (b) Running time vs k 



Fig. 6. Scalability of RollbackSort 
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As we have already seen analytically in Section 4.1 and empirically in Figure 4(a), 
the algorithm for simple probabiUstic relations scales Unearly to large datasets. TA can 
further improve the performance. 

For general probabilistic databases, Figure 6 shows that RollbackSort scales well to 
large datasets. Figure 6(a) illustrates the running time of RollbackSort when n increases 
to IM tuples. The trend is more than linear, but much slower than quadratic. Figure 6(b) 
shows the impact of k on the miming time. Notice that, the general trend in Figure 6(b) 
is linear except there is a "step-up" when k is about 500. We conjecture that this is due 
to the non-Unear maintenance cost of the priority queue used in the algorithm. 

7 Conclusion 

We study the semantic and computational problems for top-fc queries in probabilistic 
databases. We propose three postulates to categorize top-A; semantics in probabilistic 
databases and discuss their satisfaction by the semantics in the Uterature. Those pos- 
tulates are the first step to analyze different semantics. Wc do not think that a single 
semantics is superior/inferior to other semantics just because of postulate satisfaction. 
Rather, we deem that the choice of the semantics should be guided by the appUcation. 
The postulates help to create a profile of each semantics. We propose a new top-fc se- 
mantics, namely Global- Topfc, which satisfies the postulates to a large degree. We study 
the computational problem of query evaluation under Global-Topfc semantics for sim- 
ple and general probabilistic relations when the scoring function is injective. For the 
former, we propose a dynamic programming algorithm and effectively optimize it with 
Threshold Algorithm. For the latter, we show a polynomial reduction to the simple case, 
and design Rollback and RollbackSort optimizations to speed up the computation. We 
conduct an empirical study to verify the effectiveness of those optimizations. Further- 
more, we extend the Global-TopA; semantics to general scoring functions and introduce 
the concept of allocation policy to handle ties in score. To the best of our knowledge, 
this is the first attempt to address the tie problem rigorously. Previous work either does 
not consider ties or uses an arbitrary tie-breaking mechanism. Advanced dynamic pro- 
gramming algorithms are proposed for query evaluation under general scoring functions 
for both simple and general probabilistic relations. We provide theoretical analysis fol- 
lowing every algorithm proposed. 

For completeness, we list in Table 2 the complexity of the best known algorithm for 
the semantics in the hterature. Since no other work addresses general scoring functions 
in a systematical way, those results are restricted to injective scoring functions. 



Semantics 


Simple Probabilistic DB 


General Probabilistic DB 


Global-Topfe 
PT-A; 
U-Topfc 
U-fcRanks 


0{kn) 
olkn) 
0{n log k) 
0{kn) 


O(fcn^) 

0{n log k) 
Oikn^) 



Table 2. Time Complexity of Different Semantics 
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8 Future Work 

Several variants of the existing semantics have been proposed in the literature [22], their 
postulate satisfaction deserves further study. So far, the research reported in the liter- 
ature has primarily focused on indepedent and exclusive relationships among tuples 
[21, 22, 24, 27]. It will be interesting to investigate other complex relationships between 
tuples. Other possible directions include top-fc evaluation in other uncertain database 
models proposed in the literature [13] and more general preference queries in proba- 
bilistic databases. 
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10 Appendix A: Semantic Postulates 



Semantics 


Exact k 


Faithfulness 


Stability 


^Global-Topfc 


/(I) 


//X (5) 




VY-k 


X (2) 


//X (6) 


/(lO) 


U-Topfc 


X (3) 


//X (7) 


/(II) 


U-/.:Ranks 


X (4) 


X (8) 


X (12) 



Postulates of Global-Topfc semantics are 
proved under general scoring functions with 
Equal allocation policy. 



Table 3. Postulate Satisfaction for Different 
Semantics in Table 1 

The following proofs correspond to the numbers next to each entry in the above 
table. Assume that we are given a probabilistic relation Rp = {R, p, C), a non-negative 
integer k and an injective scoring function s. 

10.1 Exact k 

(1) Global-Top/c satisfies Exact k. 

We compute the Global-TopA; probability for each tuple in R. If there are at least k 
tuples in R, we are always able to pick the k tuples with the highest Global-Topfc 
probability. In case when there are more than k — r + 1 tuple(s) with the rth highest 
Global-Top A; probability, where r = 1,2. . .,k, only fc — r -|- 1 of them will be 
picked nondeterministically. 

(2) PT-A: violates Exact k. 

Example 4 illustrates a counterexample in a simple probabilistic relation. 
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(3) U-TopA; violates Exact k. 

Example 4 illustrates a counterexample in a simple probabilistic relation. 

(4) U-A;Ranks violates Exact k. 

Example 4 illustrates a counterexample in a simple probabilistic relation. 



10.2 Faithfulness 

(5) Global-Topfc satisfies Faithfulness in simple probabilistic relations while it violates 
Faithfulness in general probabilistic relations. 
(5a) Simple Probabilistic Relations 

By the assumption, ti >-« t2 and p{ti) > p{t2), so we need to show that 

PkAti) > PkAt2)- 

For every W € pwd{RP) such that t2 G allk,siW) ^nd ti ^ allk,s{^)^ 
obviously ti ^ W. Otherwise, since t\ >-« t2, t\ would be in allk,s{W). 
Since all tuples are independent, there is always a world W G pwd{Rf), 
W' = {W\{t2}) U {h} and Pr{W') = Pr{W)0^y Since > 
p{t2), Pr{W') > Pr{W). Moreover, ti will substitute for t2 in the top-fc 
answer set to W. It is easy to see that a{ti, W) = 1 in W' and also in any 
world W such that both ti and t2 are in allk,s{W), a{ti, W) = 1. 
Therefore, for the Global-TopA; probabihty of ti and t2, we have 



PkAt2)= Yl a{t2,W)Pr{W) + Yl a{t2,W)Pr{W) 

WepwdiB.") WepwdiRP) 

< Pr{W)+ Y Pr{W') 

WepwdiR") W'epwdiRP) 
tieaUk,s{W) tieallk,siW) 
t2eaHfc,s(VK) t2^W' 

Y a{h,W)Pr{W)+ Y a{ti,W')Pr{W') 

wepwd(RP) W'epwdiRP) 

tieaUk,s{W) tiGaUk,s{W') 
t2eaUk_s{W) t2iW' ' 

< Y a{ti,W)Pr{W) + Y a{ti,W')Pr{W') 

W£pwd(RP) W'epwd(BP) 
tieaUk,s(W) tieaUk,s{W) 
t2eaUk,s(W) t2^W' ' 

+ Y a{ti,W")PriW") 

W"epwd(RP) 
tieallk,s{W") 

t2ew"' 

t2'^allk,AW") 

= PkAti)- 

The equality in < holds when 5(^2) is among the k highest scores and there 
are at most k tuples (including t2) with higher or equal scores. Since there is at 
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least one inequality in the above equation, we have 



(5b) General ProbabiUstic Relations 
The following is a counterexample. 

Say k = 1, R = {ti, . . . ,1^}, ti ■ ■ ■ ''^s tg, {ti, . . . ,1^,1^3} are exclusive. 
p{ti) = 0.1, i = 1 . . . 7, p{U) = 0.4, p{tg) = 0.3. 

By Global-Topfc, the top-1 answer is {Iq}, while 1% ^9 and p{ts) > pitg), 
which violates Faithfulness. 

(6) PT-A; satisfies Faithfulness in simple probabiUstic relations while it violates Faith- 
fulness in general probabilistic relations. 

For simple probabihstic relations, we can use the same proof in (5) to show that PT- 
k satisfies Faithfulness. The only change would be that we need to show Pk,s{ti) > 
Pr as well. Since Pk,s{t2) > Pr and Pk.s{ti) > Pk.s{t2), this is obviously true. 
For general probabihstic relations, we can use the same counterexample in (5) and 
set threshold Pr = 0.15. 

(7) U-Topfc satisfies Faithfulness in simple probabilistic relations while it violates Faith- 
fulness in general probabiUstic relations. 

(7a) Simple Probabilistic Relations 

By contradiction. If U-Topfc violates Faithfulness in a simple probabilistic rela- 
tion, there exists Rp = {R,p,C) and exists ti,tj € R,ti >-« tj,p{ti) > p{tj), 
and by U-Topfc, tj is in the top-A; answer set to Rp under the scoring function 
s while ti is not. 

S is a top-fc answer set to Rp under the function s by the U-Topfc semantics, 
tj G S and ti S. Denote by Qk,s {S) the probabihty of S under the U-Topfc 
semantics. That is, 

S=topfe,,(W) 

For any world W contributing to Qk,s{S), ti ^ W. Otherwise, since ti tj, 
ti would be in topk^siW), which is S. Define a world W = iW\{tj}) U 
{ti}. Since ti is independent of any other tuple in R, W' S pwd{RP) and 
Pr{W') = Pr(l^)||]g^. Moreover, topk,s{W') = {S\{tj}) U {ti}. Let 
S' = {S\{tj}) U {ti}, then W contributes to Qk,s{^')- 
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WepwdiR") 

S'=topfc,,(W) 



> 



^ Pr{{W\{tj}) U{U}) 



WepwdiR") 
S=topk,s(.W) 

S=topk,s{W) 

S=lopk.s{W) 



^ P(ti)p{tj) 
P{ti)p{tj) 
> Qk,s{S), 



which is a contradiction. 
(7b) General Probabilistic Relations 
The following is a counterexample. 

Say k = 2, R= {ti,t2,t3,t4}, ti >-s t2 H ^4, ti and t2 are exclusive, 

and t4 are exclusive. p{ti) = 0.5, p{t2) = 0.45, pits) = 0.4, p{ti) 0.3. 
By U-Topfc, the top-2 answer is {ii, fa}, while t2 h and p{t2) > pits), 
which violates Faithfulness. 
(8) U-A;Ranks violates Faithfulness. 
The following is a counterexample. 

Say k^2,RP is simple. R = {ii,i2,i3}, h ^2 U,p{ti) = 0.48,p(t2) = 
0.8,p(t3) =0.78. 

The probabilities of each tuple at each rank are as follows: 



tl t2 ts 



rank 1 0.48 0.416 0.08112 
rank 2 0.384 0.39936 
rank 3 0.29952 



By U-A-Ranks, the top-2 answer set is {ti, fs} while t2 >- t^ and p{t2) > pita), 
which contradicts Faithfulness. 



10.3 StabUity 

(9) Global-Topfc satisfies Stability. 

In the rest of this proof, let A be the set of all wiimers under the Global-Topfc 

semantics. 

Part I: Probability. 
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Case 1: Winners. 

For any winner t G A, if we only raise the probability of t, we have a new 
probabiUstic relation (Rp)' = {R,p',C), where the new probability function 
p' is such that p'{t) > p(t) and for any t' G R,t' ^ t,p'{t') = p{t'). Note 
that pwd{RP) = pwd{{RPy). In addition, assume t G Q, where Ct e C. By 
Global-Topfc, 



and 



WepwcK^R") ^ ' 

teallk,s{W) 

For any other tuple t' € R,t' ^ t, we have the following equation: 



t'eaiik,s(w)dew 

+ E "(*''^)^^(^)^^ 

WepwdiR^) ' 

(Ct\{t})nH'=0 
+ E a{i! ,W)Pr{W) 

WepwdiR") 
t'eallk :,iW), tiw 

E oc{i! ,W)Pr{yV) 
tew 

+ E a{t',W)Pr{W) 

WepwdiR") 
t'eaUk,s(W), t^w 

{Ct\{t})nw=0 
+ E W)Pr{W)) 

WepwdiR") 

t'eaiik s(w), tiw 
(Ct\{t})nw#0 

^t^pR-u,^ 
~ p{t)^'^'^^*'' 
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where c = 1 - T.t"eCt\{t}Pit")- 

Now we can see that, t's Global-Topfc probabiUty in (RP)' will be raised to 
exactly times of that in Rp under the same weak order scoring function 
s, and for any tuple other than t, its Global-Topfc probability in (Rp)' can be 

raised to as much as ^^^r times of that in Rp under the same scoring function 

pit) ° 

s. As a result, (t) is still among the highest k Global-Topfc probabilities 

in (ii^")' under the function s, and therefore still a wiimer. 

Case 2: Losers. 

This case is similar to Case 1. 
Part II: Score. 

Case 1: Winners. 

For any winner f € A, we evaluate RP under a new general scoring function 
s' . Comparing to ,s, ,s' only raises the score of t. That is, s'{t) > s{t) and 
for any t' E R.t' 7^ t,s'{t') = s{t'). Then, in addition to all the worlds 
already totally (i.e., a{t, W) = 1) or partially (i.e., a{t, W) < 1) contributing 
to t's Global-Topfc probability when evaluating Rp under s, some other worlds 
may now totally or partially contribute to fs Global-Topfc probability. Because, 
under the function s', t might climb high enough to be in the top-fc answer set 
of those worlds. Moreover, if a possible world W contributes partially under 
scoring function s, it is easy to see that it contributes totally under scoring 
function s'. 

For any tuple t" other than tiaR, 

(i) If s(f/') ^ s{t), then its Global-Topfc probability under the function s' 
either stays the same (if the "climbing" of t does not knock that tuple out 
of the top-fc answer set in some possible world) or decreases (otherwise); 

(ii) If s(t") = s{t), then for any possible world W contributing to t"'s Global- 
Topfc under scoring function s, a{t",W) ~ and now under scor- 
ing function s', a'{t",W) = < ^ = a{t",W). Therefore the 
Global-Topfc of t" under scoring function s' is less than that under scoring 
function s. 

Consequently, t is still a winner when evaluating Rp under the function s'. 
Case 2: Losers. 

This case is similar to Case 1. 
(10) PT-fc satisfies 5fflfei7ify. 

In the rest of this proof, let A be the set of all winners under the PT-fc semantics. 
Part I: Probabihty. 
Case 1: Winners. 

For any winner i G A, if we only raise the probability of t, we have a new 
probabihstic relation [Rp)' = {R,p',C), where the new probabihty function 
p' is such that p'{t) > p{t) and for any t' e R,t' ^ t,p'{t') = p{t'). Note 
±at pwd{RP) = pwd{{RP)'). In addition, assume t G Ct, where Ct G C. The 
Global-Topfc probabihty of t is such that 

<W= E Pr{W)>Pr 

tetopk,s(w) 
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and 



Therefore, ' (f) is still above the threshold Pt, and t still belongs to the 

top-fc answer set of [Wy under the function s. 
Case 2: Losers. 

This case is similar to Case 1. 
Part II: Score. 
Case 1: Winners. 

For any winner t e ^, we evaluate RP under a new scoring function s' . Com- 
paring to s, ,s' only raises the score of t. Use a similar argument as that in 
(9) Part II Case 1 but under injective scoring functions, we can show that the 
Global-Topfc probabiUty of t is non-decreasing and is stiU above the threshold 
Pt- Therefore, tuple t stiU belongs to the top- A; answer set under the function 
s'. 

Case 2: Losers. 

This case is similar to Case 1. 
(11) U-Top/c satisfies Stability. 

In the rest of this proof, let A be the set of all winners under U-Topfc semantics. 
Part I: Probability. 
Case 1: Winners. 

For any winner t £ A, \f we only raise the probability of /, wc have a new 
probabilistic relation (i?^)' = {R,p', C), where the new probabilistic function 
p' is such that p'{t) > p{t) and for any t' G R,t' ^ t,p'{t') = p{t'). In 
the following discussion, we use superscript to indicate the probability in the 
context of {RP)' . Note \h&tpwd{RP) = pwd{{RP)'). 

Recall that Qk,s{At) is the probability of a top-fc answer set At C A under 
U-Topfc semantics, where t G At. Since t G At, Q'k^s(^t) = QkA^t)^^- 
For any candidate top-Zc answer set B other than At, i.e., 3W G pwd{RP), topk,s {W) 
B and 5 / At. By definition, 

QkAB) < QkA^t). 

For any world W contributing to Qk,s{B), its probability either increase 
times (if t G W), or stays the same (if t ^ W and 3t' G W, t' and t are 
exclusive), or decreases (otherwise). Therefore, 



Q',AB)<QkAB)^- 



Altogether, 



Q'k,s{B) < ^m(s)^ < = Qk,Mt)- 
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Therefore, At is still a top-fc answer set to (i?^)' under the function s and 
t G Afis still a winner. 
Case 2: Losers. 

It is more complicated in the case of losers. We need to show that for any loser 
t, if we decrease its probability, no top-fc candidate answer set Bt containing t 
will be a new top-fc answer set under the U-Topfc semantics. The procedure is 
similar to that in Case 1, except that when we analyze the new probability of 
any original top-fc answer set Aj, we need to differentiate between two cases: 

(a) t is exclusive with some tuple in Af, 

(b) t is independent of all the tuples in A; . 

It is easier with (a), where all the worlds contributing to the probability of 
Ai do not contain t. In (b), some worlds contributing to the probabiUty of Ai 
contain t, while others do not. And we calculate the new probability for those 
two kinds of worlds differently. As we will see shortly, the probability of A^ 
stays unchanged in either (a) or (b). 

For any loser t E R,t ^ A, hy applying the technique used in Case 1, we have 
a new probabilistic relation (Rp)' — {R,p',C), where the new probabilistic 
function p' is such thatp'(t) < p{t) and for any t' e R,t' t,p'{t') = p{t'). 
Agaia, pwd{RP) = pwd{{RPy). 

For any top-fc answer set Ai to Rp under the function ,s, Ai C A. Denote by 
all the possible worlds contributing to Qk,s {Ai). Based on the membership 
of t, partitioned into two subsets S^. and S*^. . 

Sa, = {W\W e pwdiRP),topk..siW) = Ai}; 

Sa, =5^^U5^^,5^^n5^^ =0, 

yW e S\l,tGW andVW G 5^^,^ ^ W. 

If t is exclusive with some tuple in A^, 5^. = 0. In this case, any world W G 
5^. contains one of t's exclusive tuples, therefore W's probability will not be 
affected by the change in t's probabiUty. In this case, 

— Qk,s{Ai) ■ 

Otherwise, t is independent of all the tuples in A^. In this case, 

T.wepwdiR") Pr{W) 

wes*^. ^ p{t) 

TiW&pwdiR") Pr{W) 1 - p{t) 
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and 



WepwdiR") ^^^^ 



WepwdiR") 

weSAi 
= QkA^i)- 

We can see that in both cases, Q'^ = Qk,s{^i)- 

Now for any top-A; candidate answer set containing t, say Bt such that Bt ^ A, 
by definition, Qk,s{Bt) < Qfe,s(^i)- Moreover, 



Therefore, 



Consequently, Bt is still not a top-fc answer set to (i?^)' under the function s. 
Since no top-fc candidate answer set containing t can be a top-fc answer set to 
(Rpy under the function s, t is still a loser. 
Part II: Score. 

Again, Ai C A is a top-fc answer set to W under the function s by U-Topfc seman- 
tics. 

Case 1: Winners. 

For any winner t € Aj, we evaluate under a new scoring function s'. Com- 
paring to s, s' only raises the score of t. That is, s'{t) > s{t) and for any 
t' G R,t' ^ t, s'{t') = s(t'). In some possible world such that W G pwd{RP) 
and topk,s{W) ^ Ai, t might climb high enough to be in topk,s' {W). Define 
T to the set of such top-fc candidate answer sets. 

T = {topk,s'{W)\W G pwd{RP), t ^ topk,s{W) A t G topk,s'{W)}. 

Only a top-fc candidate set Bj G T can possibly end up with a probabiUty 
higher than that of Ai across all possible worlds, and thus substitute for Ai as 
a new top-A; answer set to Rp under the function s'. In that case, t £ Bj,sot is 
still a winner. 
Case 2: Losers. 

For any loser t G R,t ^ A. Using a similar technique to Case 1, the new 
scoring function s' is such that s'{t) < s{t) and for any t' € R,t' ^ t, s'{t') = 



49 



s{t'). When evaluating Rp under the function s', for any world W G pwd{RP) 
such that t ^ topk,s (W), the score decrease of t will not effect its top-fc answer 
set, i.e., topk,s'(W) = topk,s{W). For any world W G pwd{RP) such that 
t G topfe,s(W), < might go down enough to drop out of topk^s'iW)- 1° this 
case, W will contribute its probability to a top-A; candidate answer set without 
t, instead of the original one with t. In other words, under the function s', 
comparing to the evaluation under the function s, the probabihty of a top-fc 
candidate answer set with t is non-increasing, while the probabihty of a top-fc 
candidate answer set without t is non-decreasing"'. 

Since any top-fc answer set to BP under the function s does not contain t, it 
follows from the above analysis that any top-fc candidate answer set containing 
t wiU not be a top-fc answer set to RP under the new function s', and thus t is 
still a loser. 
(12) U-fcRanks violates Stability. 

The following is a counterexample. 

Say k = 2,RP is simple. R = {ii,i2,i3}, h t2 is- p{ti) = 0.3,p{t2) = 
0.4,p(t3) =0.3. 

h t2 h 

rankl 0.3 0.28 0.126 
rank 2 0.12 0.138 
rank 3 0.036 

By U-fcRanks, the top-2 answer set is {ti, t^}. 
Now raise the score of such that ti h- 

rank 1 0.3 0.21 0.196 
rank 2 0.09 0.168 
ranks 0.036 

By U-fcRanks, the top-2 answer set is {ti,t2}. By raising the score of t^, we actu- 
ally turn the winner to a loser, which contradicts Stability. 



11 Appendix B: Proofs 
11.1 Proof for Proposition 1 

Proposition 1. Given a simple probabilistic relation R^ = {R, p, C) and an injective 
scoring function s over Rp, if R = {ti,t2, - ■ ■ , tn} and ti h ■ ■ ■ tn> the 
following recursion on Global-Topk queries holds. 



q{k,i) = 



fc = 

_ J piU) l<i<k 



(q(k, i — 1)^^^^—!^ + q(k — l,i — l))p(ti) otherwise 
P{U-i) 



* Here, any subset of 7? with cardinality at most k that is not a top-fc candidate answer set under 
the function s is conceptually regarded as a top-fc candidate answer set with probability zero 
under the function s. 



50 



where q{k,i) = Pk,s{ti) andp{ti-i) = 1 - p{ti-x). 
Proof. By induction on k and i. 

- Base case. 

• k = Q 

For any e pwd{EF), topo^s{W) = 0. Therefore, for any U G R, the Global- 
Topfc probability of ti is 0. 

• A: > and i = 1 

t\ has the highest score among all tuples in R. As long as tuple ti appears in a 
possible world W, it will be in the topk.s {W). So the Global-Topfc probability 
of ti is the probabihty that t\ appears in possible worlds, i.e., q{k, 1) = p{t\). 

- Inductive step. 

Assume the theorem holds for < fc < fco and 1 <i < iq. For any W E pwd(RP), 
tig e topko,s{W) iff tig e W and there are at most ko — 1 tuples with a higher 
score in W. Note that any tuple with score lower than the score of tig does not have 
any influence on (j(fco, io), because its presence/absence in a possible world will 
not affect the presence of tig in the top-A; answer set of that world. 
Since all the tuples are independent, 

q{ko,io)=p{tio) ^^W- 

WepwdiR") 

\{t\teWAtystig}\<ko 
(1) q{ko,io + 1) is the Global-Topfco probabihty of tuple tig+i. 

qiko,to + l)= J2 ^^(^) 

tio+ietopkg,siw) 
Ugetopko,s{w) 

+ Pr{W) 

wepwdiR.") 
ti„+ieiopkg,s{W) 

tigeW, Ug0OPkg,s(W) 

+ Yl Pr{W). 

WepwdiR") 

Ug + ietOPkg,3{W) 
UgiW 

For the first part of the left hand side, 

Y Pr{W)=p{tig+i)q{ko-l,io). 

WepwdiR") 

tig+l€tOpkg,siW) 

tigetopkg-i_s{w) 

The second part is zero. Since tig iio+i' if ^«o+i ^ topko,s{W) and tig e 
W, then tig e topkg,s{W). 
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The third part is the sum of the probabihties of all possible worlds such that 
Uo+i € ^) tio ^ ^ ^^'^ there are at most fco — 1 tuples with score higher than 
the score of in W. So it is equivalent to 



\{t\teWAtystio}\<ko 



P{tio) 

Altogehter, we have 



q{kQ,iQ + 1) 



= p{Uo+i)q{ko - l,io) +piUo+i)p{tio) 



P{tio) 



= {q{ko - 1, io) + q{ko, io)-7rh)piUo+i)- 

Pv^io) 

(2) q{ko + 1, id) is the Global-Top(fco + 1) probability of tuple tig. Use a similar 
argument as above, it can be shown that this case is correctly computed by 
Equation (3) as well. 



11.2 Proof for Theorem 2 

Theorem 2 (Correctness of Algorithm 1^"*). Given a simple probabilistic relation 

W = {R.p, C), a non-negative integer k and an infective scoring function s over BF, 
the above TA-based algorithm correctly finds a Global-Topk top-k answer set. 

Proof. In every iteration of Step (2), say t = U, for any unseen tuple t, s' is an injective 
scoring function over BF, which only differs from s in the score of t. Under the function 

s' , ti ^s' t ti+i- If we evaluate the top-Zc query in W under s' instead of s, 
Pk,s'{t) = ^UP. On the other hand, for any W G pwd{RP), W contributing to 
Pk,s{t) implies that W contributes to Pk,s' {t), while the reverse is not necessarily true. 
So, we have Pk,s'{t) > Pfe,s(t). Recall that p > p{t), therefore UP > ^UP = 

Pk,s'{t) > Pk,s{t)- The conclusion follows from the correctness of the original TA 
algorithm and Algorithm 1 . 



11.3 Proof for Lemma 1 

Lemma 1. Let BP = {B,p, C) be a probabilistic relation, s an injective scoring func- 
tion, t G R, and = {E,p^ ,C^) the event relation induced by t. Define = 
{E — {te^},p^ — {{tet}})- Then, the Global-Topk probability oft satisfies the 
following: 

Pi^:it)=p{t) E 

\V^e\<k 



52 



Proof. Given t £ R, k and s, let ^ be a subset of pwd{RP) such that W £ A t £ 
topk,s{W). If we group all the possible worlds in A by the set of parts whose tuple in 
W has higher score than the score of t, then we will have the following partition: 

^ = Ai U A2 U . . . U A,, n = 0, i ^ j 

and 

V^i,VVFi,W2 G Ai,i= l,2,...,g, 

{Cj\3t' e Wl(^Cj,t' ys t} = {Cj\3t' e W2nCj,t' t}. 

Moreover, denote CharParts{Ai) to Aj's characteristic set of parts. 

Now, let i? be a subset of pwd{QP), such that We E B ^ \We\ < k. There is a 
bijection g : {Ai\Ai € A} ^ B, mapping each part Ai in A to a possible world in B 
which contains only tuples corresponding to the parts in Ai 's characteristic set. 

g{Ai) = {tec- IQ £ CharParts{Ai)}. 

The following equation holds from the definition of an induced event relation and 
Proposition 2. 

^ pr{w) = Pit) n p(^^o, ) n (1 - p^t^o, )) 

WeAi deCharPartsiAi) CieC-{Cia(t)} 

CiiCharParts{Ai) 

= p{t)Pr{g{Ai)). 

Therefore, 

WeA i=l WeAi 

= j2p{t)Pr{g{Ai)) =p{t)Y,Pr{9{Ai)) 

i=l i=l 
= p{t) E ^KW^e) 

^p{t)i E ^KW^e)). 
HVe|<fe 



11.4 Proof for Proposition 3 

Proposition 3 (Correctness of Algoritlim 4). Given a probabilistic relation W = 
{R,p, C) and an infective scoring function s, for any t € R^, the Global-Topk prob- 
ability oft equals the Global-Topk probability oftgf when evaluating top-k in the in- 
duced event relation PP ~ {E,p^ ,C-^) under the infective scoring function : E ^ 
E, s^{tet) = 5 and s^(tecj = i: 
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Proof. Since tg^ has the lowest score under s^, for any We G pwd{EP), the only 
chance S topi-^s'^ {We) is when there are at most k tuples in We, including te^ ■ 

VWe e pwd{EP), 

te, G topk,siWe) <^ (te« G We A |We| < fc). 

Therefore, 

In the proof of Lemma 1, B contains all the possible worlds having at most k — 1 
tuples from E — {iet }• By Proposition 2, 

^ PriWe)=pit) J2 P<K)- 

tet6WeA|W'e|<fe W^eB 

By Lenoma 1, 

Pit) ^ PriWi)=P,^:it). 
w^eB 

Consequently, 



11.5 Proof for Proposition 4 



Proposition 4 (Correctness of Algoritlim 5). Let BP = {R,p,C) be a simple prob- 
abilistic relation where R = {^i, . . . ,t„}, ti ^2 ■ • • tm k a non-negative 
integer and s a scoring function. For every ti G R, the Global-Topk probability ofti 
can be computed by the following equation: 

k'=0 

where R^iti) is R^ restricted to {t G R\t ~s ti}. 



Proof. Given a tuple <i G i?, let i?e be the support relation i? restricted to {i G R\t9ti}, 
and Rg be Rp restricted to Re , where 9 G {>-,~,^,^} (subscript s omitted) . Similarly, 
for each possible world W G pwd{RP), We = W n Re- 
Each possible world W G pwd{RP) such that ti G allk,siW) contributes 
min(l, !^)Pr{W) to P^^J (t;), where a = |W^| and 6 =\W^\. 



54 



PZ{ti)= E min(l, ^)Pr(W^) 

Wepwd(R''),ti^W 
\W~^\=afi<a<k-l 
\W^\=b,l<b<m 

k — lrn , 

a=Ob=l Wepwd{RP),ti&W 
\Wy\=a,\W^\=b 

k—1 m J 

\Wy\=a \Wr.\=b 
k—1 m 7 

= E( E Pr-(Wv)E^i'^(l'^)( E Pr{W^))) 

0=0 W^^6pu,d(Hj.) 6=1 Vr^ept«d(i?'^),tj6W^ 

fc— 1 m , 

= E(^«.[i.iE"^i^(i'^)( E ^^(^-) E ^Kw^^))) 

a=o 6=1 VF^epW(ii^),t!eiy^ iy^epW(ii!,) 

HV^|=6 

k—1 m , 

= E(^«.[i.lE^i'^(l'-^)( E Pr{W^))) 

0=0 6=1 iy^epW(flS,),t!eiy^ 

HV^|=6 

k—1 

= E^°'[^i] 'Pk-a,'sitl) 
a=0 

where m is the number of tying tuples with ti (inclusive), i.e., m = \RP{ti)\. 



11.6 Proof for Proposition 5 

Proposition 5. G?ven a probabilistic relation RF — {R, p, C) and a scoring function s, 
for any t G R^, the Global-Topk probability oft equals the Global-Topk probability of 
te-i,,r^ when evaluating top-k in the induced event relation = {E,p^ , C^) under the 
scoring function : E ^ R, s^{tet,y-) = \, s-^(iet,~) = \, s^{tec.,~) = \ and 

Pk,s (*) = ■fj^sE(*et,~)- 



Proof. Similar to what we did in the Proof for Lemma 1. We are trying to create a 
bijection. 

Given t € R, k and s, let A be a subset of pwd{RP) such that G A o t e 
0'ttk,s{W)- If we group all the possible worlds in A by the set of parts whose tuple in 
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W has a score higher than or equal to that of t, then we will have the following partition: 

^ = Ai U A2 U . . . U A,, n = 0, « 7^ j 

and 

VAi,\^Wi,W2e Ai,i= 1,2,..., q, 

{Cj^y\3t' e WinCj,t' ys t} = {Cj>|3i' e W2r\Cj,t' t} 

and 

{Q,^\3t' e Wi n Cj,t' t} = {Cj,^\3t' e W2 n Cj,t' ~, <}• 

Moreover, denote CharParts{Ai) to Aj's characteristic set of parts. Note that all W e 
Ai have the same allocation coefficient a{t, W), denoted by aj. 

Now, let B be a subset of pwd{EP), such that W,, e B <^ tet.r^ e aIlk,s{We). 
There is a bijection g : {Ai\Ai G A} ^ B, mapping each part in A to the a possible 
world in B which contains only tuples corresponding to parts inAi's characteristic set. 

g{Ai) = {tec^,y\Cj,y £ CharParts{Ai)} U {tec^,~|Cj,~ e CharParts{Ai)} 

Furthermore, the allocation coefficient of Ai equals to the allocation coefficient 
a(tet,~, 9{Ai)) under the function s^. 

The following equation holds from the definition of an induced event relation under 
general scoring functions. 

^ Priw) = n p(tec,,y) n p(*^o„-) 

WeAi Ci,yeCharParts(Ai) Ci,^eCharParts(Ai) 

Ci,r^gCharParts{Ai) 
Ci^^^CharParts(Ai) 



Therefore, 



= Pr{g{Ai)). 



WeA i=l WeAi 

= j2<^iPr{g{Ai)) =j2<^{te„^,g{Ai))Pr{g{Ai)) 

i=l i=l 

= E "(*et,~, We)Pr(We) (fii is a bijcctiou) 

w^eB 



1 1 .7 Proof for Theorem 5 

Theorem 5. Given a probabilistic relation RP = {R, p,C),a scoring function s,tG R^, 
and its induced event relation = {E,p^ ,C^), where \E\ = 2m, the following 
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recursion on Uy{k',i,b) and u^{k',i,b) holds, where 6max is the number of tuples 

with a positive probability in E^. 

When i = 1,0 < k' < m and < 6 < femax. 



Uy{k',l,b) 



k' = l,b=Q 
otherwise 



u^{k',l,b)^{P^^*''-^ 



k' = l,b = l 
otherwise 



For every i, 2 < i < m, < k' < m and < 6 < ?>„ 
Uy{k',i,b) = 



(12) 



Condition 


Formula 


k' = 





l<k' < m,p^{ti_i^y) > 


+uy{k' -l,i-l,b) 

+U^ik' -l,i-l,b))p^{ti,y) 


1 < fc' < m,p^{ti-i^y) = 
and <b < 6max 




+uy{k' -l,i-l,b) 

+U^ik' -l,i-l,b))p^iU,y) 


l<k' <m, p^{ti-l,y) = 

and b = 6max 


{Uy{k' -l,i-l,b)+ U^{k' -1,1-1, b))p^{ti,y) 



u^{k',i,b) = 



(13) 



Condition 


Formula 


k' = Oorb = 





l<k'<m,l<b< 6max 

and p^ {U-i^^) > 


+uy{k' -l,i-l,b-l) 
+u^{k' -l,i-l,b-l))p^{ti,^) 


l<fc'<m, 1<6< 6max 

and p^ (ti-i^^) = 




P'^{ti-l^y) 

+uy{k' -l,i-l,b-l) 
+Ur.{k' - l,i-l,6- 
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The Global-Topk probability oftet,r^ in under the scoring function s can be com- 
puted by the following equation: 



6=1 k' = l 



fe'=fe+l 



Proof. Equation (14) follows Equation (12) and Equation (13) as it is a simple enumer- 
ation based on Definition 8. We are going to prove Equation (12) and Equation (13) by 
an induction on i. 



- Base case: i = 1,0 < k' < m and < 6 < 6max 

When i = 1, based on the definition of //, the only non-zero entries are (1, 1, 0) 
and ( 1 , 1 , 1 ) . The former is the probability sum of all possible worlds which con- 
tain ti^y and do not contain ti^^. The second requirement is redundant since those 
two tuples are exclusive. Therefore, it is simply the probability of ti^y^. Similarly, 
the latter is the probabihty sum of all possible worlds which contain ti^^ and do 
not contain ti^y. Again, it is simply the probability of ti,^. It is easy to check that 
no possible worlds satisfy other combinations of k' and b when i = 1, therefore 
their probabiUties are 0. 

- Inductive step. 

Assume the theorem holds for i < iq, < k' < m and < 6 < 6max, where 
1 < io < m — 1. 

Denote Ey^^^ and to the set of the first i tuples in Ey and E^ respectively. 

For any W G pwd{E^), by definition, W contributes to Uy/^{k' , io, b) iff ti„^y/^ G 
W and \W n U £;^,[ioi)| = k' and \W n = b. Since Ey^^^^^ n 

-B^jio] = 0, we have: 

contributes to w^/^(fc',«o,^')-^iio y/^ G and \WnEy = k'-band \Wn 
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(1) u^{k', io + 1, b) is the probability sum of all possible worlds W such that 
U,+i,y GW,\Wn i;^,[io+i] I = fc' - 6 and I W n I = 

Uy{k',io + l,b)= 

\wnEy,i,^+^\=k'-b 

\" Pr(W^ ^^^"'^^ ^ ^' 

H'eptud(£;p),t,o+i,!.eH' 
I ^nis^, [.oil 

|W'nB^.[io,|=6 



Wepwd(EP) 

tig + i,y£W,tio,^<^W 

|wnB^,[,„]|=/c'-i-fc 



+ ^ Pr(W^) 

tig + i,^£W,tio.yiVV,tio.~iW 

|WniJ^.[,„jt=fe'-i-6 
|wnis^,[.„]|=6 

For the first part of the left hand side, 

J2 Pr{W) = p{U,+i) Yl Pr{W)=p{ti,+i)uUk'-l,io,b). 

wepwdiE") wepwd{E''),tig,yew 
tig+i,^ew,ii„,>-&w |w^n£^,[ioj|=fe'-i-6 
\wnEy,i,,^-,i\=k'^i-b \wnE^,i^,\=b 

\wnE-^,i,„]\=b 

For the second part of the left hand side, 

E Pr{W) = p{ti,+^) E Pr{W)=p{ti,+,)u^{k'-l,io,b). 

WepwdiE") Wepwd(E''),tig,^GW 

tifj+i,)^ew,tio,~^^ \wnEy [i^]|=fc'-i-6 

\wnEy^lig]\=k'-i-b \wnEJii^\=b 
\wnE^,l,^]\=b 

For the third part of the left hand side, if p{ti„^y) + p{tig^r^) = 1, then there is 
no possible world satisfying this condition, therefore it is zero. Otherwise, 

E Pr{W) = p{U,+^) E ^'■W (^5) 

Wepwd{EP) WepwdiE") 

l^^nB^,[ig]|=fe'-i-6 \wnEj^ii^]\=b 
\wnE^,i,^]\=b 
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Equation (15) can be computed either by Equation (16) when p{tig , > or 
by Equation (17) when p{ti^,^) > and b < 6max- Notice that at least one 
of , y) and p{tif^ , ~) is positive, otherwise neither tuple is in the induced 
event relation according to Definition 1 1 . 



E 

|wn£;^.,,„j|=fe'^i- 



Pr{W) = 



P{tio,y) 



|iynE^,[io]|=fc'-6 
|iynE^,[i(,,|=6 



Uy{k',io,b). 



(16) 



E = 

\WnE.^_y,^^]\=k'-l-b 
\WnE^^[i^]\=b 

= '-^AA.zp(^^^^k',io,b+i). (17) 

PiUo,--) 

A subtlety is that when p{tif^ , = and b = 6max, neither Equation (16) nor 
Equation (17) appUes. However, in this case, one of the conditions in Equation 
(15) is that \W O -E^ j^j,] | = 5 = &niax, which implies io = m. Otherwise, the 
world W does not have enough tuples from Er^. On the other hand, we know 
that io < m — 1. Therefore, there are simply no possible worlds satisfying the 
condition in Equation (15), and Equation (15) equals 0. 

Altogether, we show that this case can be correctly computed by Equation (12). 
(2) Ur^{k',io + is the probabihty sum of all possible worlds W such that 
eW,\Wr\ = k' -b and \W n £'^,[,0+1] I = b. Using a 

similar argument as above, it can be shown that this case is correctly computed 
by Equation (13) as well. 



y\ «o,~; wepwd{E''),tig^^ew 
HynE^,[*„] 1=6+1 
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